[PATCH] mm, kasan: don't poison boot memory

Mon Feb 22 10:13:24 EST 2021

On 2/22/2021 4:52 AM, David Hildenbrand wrote:
> On 20.02.21 00:04, George Kennedy wrote:
>>
>>
>> On 2/19/2021 11:45 AM, George Kennedy wrote:
>>>
>>>
>>> On 2/18/2021 7:09 PM, Andrey Konovalov wrote:
>>>> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy
>>>> <george.kennedy at oracle.com> wrote:
>>>>>
>>>>>
>>>>> On 2/18/2021 3:55 AM, David Hildenbrand wrote:
>>>>>> On 17.02.21 21:56, Andrey Konovalov wrote:
>>>>>>> During boot, all non-reserved memblock memory is exposed to the 
>>>>>>> buddy
>>>>>>> allocator. Poisoning all that memory with KASAN lengthens boot 
>>>>>>> time,
>>>>>>> especially on systems with large amount of RAM. This patch makes
>>>>>>> page_alloc to not call kasan_free_pages() on all new memory.
>>>>>>>
>>>>>>> __free_pages_core() is used when exposing fresh memory during 
>>>>>>> system
>>>>>>> boot and when onlining memory during hotplug. This patch adds a new
>>>>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() 
>>>>>>> through
>>>>>>> free_pages_prepare() from __free_pages_core().
>>>>>>>
>>>>>>> This has little impact on KASAN memory tracking.
>>>>>>>
>>>>>>> Assuming that there are no references to newly exposed pages
>>>>>>> before they
>>>>>>> are ever allocated, there won't be any intended (but buggy)
>>>>>>> accesses to
>>>>>>> that memory that KASAN would normally detect.
>>>>>>>
>>>>>>> However, with this patch, KASAN stops detecting wild and large
>>>>>>> out-of-bounds accesses that happen to land on a fresh memory page
>>>>>>> that
>>>>>>> was never allocated. This is taken as an acceptable trade-off.
>>>>>>>
>>>>>>> All memory allocated normally when the boot is over keeps getting
>>>>>>> poisoned as usual.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Konovalov <andreyknvl at google.com>
>>>>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d
>>>>>> Not sure this is the right thing to do, see
>>>>>>
>>>>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@oracle.com 
>>>>>>
>>>>>>
>>>>>>
>>>>>> Reversing the order in which memory gets allocated + used during 
>>>>>> boot
>>>>>> (in a patch by me) might have revealed an invalid memory access 
>>>>>> during
>>>>>> boot.
>>>>>>
>>>>>> I suspect that that issue would no longer get detected with your
>>>>>> patch, as the invalid memory access would simply not get detected.
>>>>>> Now, I cannot prove that :)
>>>>> Since David's patch we're having trouble with the iBFT ACPI table,
>>>>> which
>>>>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". 
>>>>> KASAN
>>>>> detects that it is being used after free when ibft_init() accesses 
>>>>> the
>>>>> iBFT table, but as of yet we can't find where it get's freed (we've
>>>>> instrumented calls to kunmap()).
>>>> Maybe it doesn't get freed, but what you see is a wild or a large
>>>> out-of-bounds access. Since KASAN marks all memory as freed during the
>>>> memblock->page_alloc transition, such bugs can manifest as
>>>> use-after-frees.
>>>
>>> It gets freed and re-used. By the time the iBFT table is accessed by
>>> ibft_init() the page has been over-written.
>>>
>>> Setting page flags like the following before the call to kmap()
>>> prevents the iBFT table page from being freed:
>>
>> Cleaned up version:
>>
>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>> index 0418feb..8f0a8e7 100644
>> --- a/drivers/acpi/osl.c
>> +++ b/drivers/acpi/osl.c
>> @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_address
>> pg_off, unsigned long pg_sz)
>>
>>        pfn = pg_off >> PAGE_SHIFT;
>>        if (should_use_kmap(pfn)) {
>> +        struct page *page = pfn_to_page(pfn);
>> +
>>            if (pg_sz > PAGE_SIZE)
>>                return NULL;
>> -        return (void __iomem __force *)kmap(pfn_to_page(pfn));
>> +        SetPageReserved(page);
>> +        return (void __iomem __force *)kmap(page);
>>        } else
>>            return acpi_os_ioremap(pg_off, pg_sz);
>>    }
>> @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address
>> pg_off, void __iomem *vaddr)
>>        unsigned long pfn;
>>
>>        pfn = pg_off >> PAGE_SHIFT;
>> -    if (should_use_kmap(pfn))
>> -        kunmap(pfn_to_page(pfn));
>> -    else
>> +    if (should_use_kmap(pfn)) {
>> +        struct page *page = pfn_to_page(pfn);
>> +
>> +        ClearPageReserved(page);
>> +        kunmap(page);
>> +    } else
>>            iounmap(vaddr);
>>    }
>>
>> David, the above works, but wondering why it is now necessary. kunmap()
>> is not hit. What other ways could a page mapped via kmap() be unmapped?
>>
>
> Let me look into the code ... I have little experience with ACPI 
> details, so bear with me.
>
> I assume that acpi_map()/acpi_unmap() map some firmware blob that is 
> provided via firmware/bios/... to us.
>
> should_use_kmap() tells us whether
> a) we have a "struct page" and should kmap() that one
> b) we don't have a "struct page" and should ioremap.
>
> As it is a blob, the firmware should always reserve that memory region 
> via memblock (e.g., memblock_reserve()), such that we either
> 1) don't create a memmap ("struct page") at all (-> case b) )
> 2) if we have to create e memmap, we mark the page PG_reserved and
>    *never* expose it to the buddy (-> case a) )
>
>
> Are you telling me that in this case we might have a memmap for the HW 
> blob that is *not* PG_reserved? In that case it most probably got 
> exposed to the buddy where it can happily get allocated/freed.
>
> The latent BUG would be that that blob gets exposed to the system like 
> ordinary RAM, and not reserved via memblock early during boot. 
> Assuming that blob has a low physical address, with my patch it will 
> get allocated/used a lot earlier - which would mean we trigger this 
> latent BUG now more easily.
>
> There have been similar latent BUGs on ARM boards that my patch 
> discovered where special RAM regions did not get marked as reserved 
> via the device tree properly.
>
> Now, this is just a wild guess :) Can you dump the page when mapping 
> (before PageReserved()) and when unmapping, to see what the state of 
> that memmap is?

Thank you David for the explanation and your help on this,

dump_page() before PageReserved and before kmap() in the above patch:

[    1.116480] ACPI: Core revision 20201113
[    1.117628] XXX acpi_map: about to call kmap()...
[    1.118561] page:ffffea0002f914c0 refcount:0 mapcount:0 
mapping:0000000000000000 index:0x0 pfn:0xbe453
[    1.120381] flags: 0xfffffc0000000()
[    1.121116] raw: 000fffffc0000000 ffffea0002f914c8 ffffea0002f914c8 
0000000000000000
[    1.122638] raw: 0000000000000000 0000000000000000 00000000ffffffff 
0000000000000000
[    1.124146] page dumped because: acpi_map pre SetPageReserved

I also added dump_page() before unmapping, but it is not hit. The 
following for the same pfn now shows up I believe as a result of setting 
PageReserved:

[   28.098208] BUG: Bad page state in process modprobe  pfn:be453
[   28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 
mapping:0000000000000000 index:0x1 pfn:0xbe453
[   28.098394] flags: 0xfffffc0001000(reserved)
[   28.098394] raw: 000fffffc0001000 dead000000000100 dead000000000122 
0000000000000000
[   28.098394] raw: 0000000000000001 0000000000000000 00000000ffffffff 
0000000000000000
[   28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
[   28.098394] page_owner info is not present (never set?)
[   28.098394] Modules linked in:
[   28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 5.11.0-3dbd5e3 #66
[   28.098394] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 0.0.0 02/06/2015
[   28.098394] Call Trace:
[   28.098394]  dump_stack+0xdb/0x120
[   28.098394]  bad_page.cold.108+0xc6/0xcb
[   28.098394]  check_new_page_bad+0x47/0xa0
[   28.098394]  get_page_from_freelist+0x30cd/0x5730
[   28.098394]  ? __isolate_free_page+0x4f0/0x4f0
[   28.098394]  ? init_object+0x7e/0x90
[   28.098394]  __alloc_pages_nodemask+0x2d8/0x650
[   28.098394]  ? write_comp_data+0x2f/0x90
[   28.098394]  ? __alloc_pages_slowpath.constprop.103+0x2110/0x2110
[   28.098394]  ? __sanitizer_cov_trace_pc+0x21/0x50
[   28.098394]  alloc_pages_vma+0xe2/0x560
[   28.098394]  do_fault+0x194/0x12c0
[   28.098394]  ? write_comp_data+0x2f/0x90
[   28.098394]  __handle_mm_fault+0x1650/0x26c0
[   28.098394]  ? copy_page_range+0x1350/0x1350
[   28.098394]  ? write_comp_data+0x2f/0x90
[   28.098394]  ? write_comp_data+0x2f/0x90
[   28.098394]  handle_mm_fault+0x1f9/0x810
[   28.098394]  ? write_comp_data+0x2f/0x90
[   28.098394]  do_user_addr_fault+0x6f7/0xca0
[   28.098394]  exc_page_fault+0xaf/0x1a0
[   28.098394]  asm_exc_page_fault+0x1e/0x30
[   28.098394] RIP: 0010:__clear_user+0x30/0x60

What would be  the correct way to reserve the page so that the above 
would not be hit?

BTW, this is running with Konrad's patch that pairs acpi_get_table & 
acpi_put_table for the iBFT table which should result in an eventual 
call to acpi_unmap() and kunmap(), though that does not occur. Could be 
a possible acpi page refcount issue that will have to be looked into.

George