[External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page

Fri Dec 6 10:48:52 PST 2024

On Fri, Dec 6, 2024 at 1:42 PM Xu Lu <luxu.kernel at bytedance.com> wrote:
>
> Hi David,
>
> On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david at redhat.com> wrote:
> >
> > On 06.12.24 03:00, Zi Yan wrote:
> > > On 5 Dec 2024, at 5:37, Xu Lu wrote:
> > >
> > >> This patch series attempts to break through the limitation of MMU and
> > >> supports larger base page on RISC-V, which only supports 4K page size
> > >> now. The key idea is to always manage and allocate memory at a
> > >> granularity of 64K and use SVNAPOT to accelerate address translation.
> > >> This is the second version and the detailed introduction can be found
> > >> in [1].
> > >>
> > >> Changes from v1:
> > >> - Rebase on v6.12.
> > >>
> > >> - Adjust the page table entry shift to reduce page table memory usage.
> > >>      For example, in SV39, the traditional va behaves as:
> > >>
> > >>      ----------------------------------------------
> > >>      | pgd index | pmd index | pte index | offset |
> > >>      ----------------------------------------------
> > >>      | 38     30 | 29     21 | 20     12 | 11   0 |
> > >>      ----------------------------------------------
> > >>
> > >>      When we choose 64K as basic software page, va now behaves as:
> > >>
> > >>      ----------------------------------------------
> > >>      | pgd index | pmd index | pte index | offset |
> > >>      ----------------------------------------------
> > >>      | 38     34 | 33     25 | 24     16 | 15   0 |
> > >>      ----------------------------------------------
> > >>
> > >> - Fix some bugs in v1.
> > >>
> > >> Thanks in advance for comments.
> > >>
> > >> [1] https://lwn.net/Articles/952722/
> > >
> > > This looks very interesting. Can you cc me and linux-mm at kvack.org
> > > in the future? Thanks.
> > >
> > > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> > > should have similar effect of RISC-V’s SVNAPOT, right?
> >
> > What is the real benefit over 4k + large folios/mTHP?
> >
> > 64K comes with the problem of internal fragmentation: for example, a
> > page table that only occupies 4k of memory suddenly consumes 64K; quite
> > a downside.
>
> The original idea comes from the performance benefits we achieved on
> the ARM 64K kernel. We run several real world applications on the ARM
> Ampere Altra platform and found these apps' performance based on the
> 64K page kernel is significantly higher than that on the 4K page
> kernel:
> For Redis, the throughput has increased by 250% and latency has
> decreased by 70%.
> For Mysql, the throughput has increased by 16.9% and latency has
> decreased by 14.5%.
> For our own newsql database, throughput has increased by 16.5% and
> latency has decreased by 13.8%.
>
> Also, we have compared the performance between 64K and 4k + large
> folios/mTHP on ARM Neoverse-N2. The result shows considerable
> performance improvement on 64K kernel for both speccpu and lmbench,
> even when 4K kernel enables THP and ARM64_CONTPTE:
> For speccpu benchmark, 64K kernel without any huge pages optimization
> can still achieve 4.17% higher score than 4K kernel with transparent
> huge pages as well as CONTPTE optimization.
> For lmbench, 64K kernel achieves 75.98% lower memory mapping
> latency(16MB) than 4K kernel with transparent huge pages and CONTPTE
> optimization, 84.34% higher map read open2close bandwidth(16MB), and
> 10.71% lower random load latency(16MB).
> Interestingly, sometimes kernel with transparent pages support have
> poorer performance for both 4K and 64K (for example, mmap read
> bandwidth bench). We assume this is due to the overhead of huge pages'
> combination and collapse.
> Also, if you check the full result, you will find that usually the
> larger the memory size used for testing is, the better the performance
> of 64k kernel is (compared to 4K kernel). Unless the memory size lies
> in a range where 4K kernel can apply 2MB huge pages while 64K kernel
> can't.
> In summary, for performance sensitive applications which require
> higher bandwidth and lower latency, sometimes 4K pages with huge pages
> may not be the best choice and 64k page can achieve better results.
> The test environment and result is attached.
>
> As RISC-V has no native 64K MMU support, we introduce a software
> implementation and accelerate it via Svnapot. Of course, there will be
> some extra overhead compared with native 64K MMU. Thus, we are also
> trying to persuade the RISC-V community to support the extension of
> native 64K MMU [1]. Please join us if you are interested.
>

Ok, so you... didn't test this on riscv? And you're basing this
patchset off of a native 64KiB page size kernel being faster than 4KiB
+ CONTPTE? I don't see how that makes sense?

/me is confused

How many of these PAGE_SIZE wins are related to e.g userspace basing
its buffer sizes (or whatever) off of the system page size? Where
exactly are you gaining time versus the CONTPTE stuff?
I think MM in general would be better off if we were more transparent
with regard to CONTPTE and page sizes instead of hand waving with
"hardware page size != software page size", which is such a *checks
notes* 4.4BSD idea... :) At the very least, this patchset seems to go
against all the work on better supporting large folios and CONTPTE.

-- 
Pedro