[RFC] ARM64: 4 level page table translation for 4KB pages

Mon Mar 31 19:11:34 EDT 2014

On Monday 31 March 2014 16:27:19 Catalin Marinas wrote:
> On Mon, Mar 31, 2014 at 01:53:20PM +0100, Arnd Bergmann wrote:
> > On Monday 31 March 2014 12:31:14 Catalin Marinas wrote:
> > > On Mon, Mar 31, 2014 at 07:56:53AM +0100, Arnd Bergmann wrote:
> > > > On Monday 31 March 2014 12:51:07 Jungseok Lee wrote:
> > > > > Current ARM64 kernel cannot support 4KB pages for 40-bit physical address
> > > > > space described in [1] due to one major issue + one minor issue.
> > > > > 
> > > > > Firstly, kernel logical memory map (0xffffffc000000000-0xffffffffffffffff)
> > > > > cannot cover DRAM region from 544GB to 1024GB in [1]. Specifically, ARM64
> > > > > kernel fails to create mapping for this region in map_mem function
> > > > > (arch/arm64/mm/mmu.c) since __phys_to_virt for this region reaches to
> > > > > address overflow. I've used 3.14-rc8+Fast Models to validate the statement.
> > > > 
> > > > It took me a while to understand what is going on, but it essentially comes
> > > > down to the logical memory map (0xffffffc000000000-0xffffffffffffffff)
> > > > being able to represent only RAM in the first 256GB of address space.
> > > > 
> > > > More importantly, this means that any system following [1] will only be
> > > > able to use 32GB of RAM, which is a much more severe restriction than
> > > > what it sounds like at first.
> > > 
> > > On a 64-bit platform, do we still need the alias at the bottom and the
> > > 512-544GB hole (even for 32-bit DMA, top address bits can be wired to
> > > 512GB)? Only the idmap would need 4 levels, but that's static, we don't
> > > need to switch Linux to 4-levels. Otherwise the memory is too sparse.
> > 
> > I think we should keep a static virtual-to-physical mapping,
> 
> Just so that I understand: with a PHYS_OFFSET of 0?

I hadn't realized at first that it's variable, but I guess 0 would be the easiest,
otherwise we wouldn't be able to use 512GB pages to map the high memory range.

> > and to keep
> > relocating the kernel at compile time without a hack like ARM_PATCH_PHYS_VIRT
> > if at all possible.
> 
> and the kernel running at a virtual alias at a higher position than the
> end of the mapped RAM? IIUC x86_64 does something similar.

That would work, yes.

Another idea is to always run the kernel at PAGE_OFFSET, as today, but create
an alias there if there isn't already RAM at that location with the fixed
PHYS_OFFSET.

> > > > > Secondly, vmemmap space is not enough to cover over about 585GB physical
> > > > > address space. Fortunately, this issue can be resolved as utilizing an extra
> > > > > vmemmap space (0xffffffbe00000000-0xffffffbffbbfffff) in [2]. However,
> > > > > it would not cover systems having a couple of terabytes DRAM.
> > > > 
> > > > This one can be trivially changed by taking more space out of the vmalloc
> > > > area, to go much higher if necessary. vmemmap space is always just a fraction
> > > > of the linear mapping size, so we can accommodate it by definition if we
> > > > find space to fit the physical memory.
> > > 
> > > vmemmap is the total range / page size * sizeof(struct page). So for 1TB
> > > range and 4K pages we would need 8GB (the current value, unless I
> > > miscalculated the above). Anyway, you can't cover 1TB range with
> > > 3-levels.
> > 
> > The size of 'struct page' depends on a couple of configuration variables.
> > If they are all enabled, you might need a bit more, even for configurations
> > that don't have that much address space.
> 
> Yes. We could make vmemmap configurable at run-time or just go for a
> maximum value.

I would just aim for 'large enough': pick a reasonable maximum RAM size
and then leave space for four times as much mem_map as we need.

> > > > > Therefore, it would be needed to implement 4 level page table translations
> > > > > for 4KB pages on 40-bit physical address space platforms. Someone might
> > > > > suggest use of 64KB pages in this case, but I'm not sure about how to
> > > > > deal with internal memory fragmentation.
> > > > > 
> > > > > I would like to contribute 4 level page table translations to upstream,
> > > > > the target of which is 3.16 kernel, if there is no movement on it. I saw
> > > > > some related RFC patches a couple of months ago, but they didn't seem to 
> > > > > be merged into maintainer's tree.
> > > > 
> > > > I think you are answering the wrong question here. Four level page tables
> > > > should not be required to support >32GB of RAM, that would be very silly.
> > > 
> > > I agree, we should only enable 4-levels of page table if we have close
> > > to 512GB of RAM or the range is too sparse but I would actually push
> > > back on the hardware guys to keep it tighter.
> > 
> > But remember this part:
> > 
> > > > There are good reasons to use a 50 bit virtual address space in user
> > > > land, e.g. for supporting data base applications that mmap huge files.
> > 
> > You may actually need 4-level tables even if you have much less installed
> > memory, depending on how the application is written. Note that x86, powerpc
> > and s390 all chose to use 4-level tables for 64-bit kernels all the
> > time, even thought they can also use 3-level of 5-level in some cases.
> 
> I don't mind 4-level tables by default but I would still keep a
> configuration option (or at least doing some benchmarks to assess the
> impact before switching permanently to 4-levels). There are mobile
> platforms that don't really need as much VA space (and people are even
> talking about ILP32).

Yes, I wasn't suggesting we do it all the time. A related question
is whether we would also want to support 3-level 64k page tables, to
extend the addressable area from 42 bit (4TB) to 55 bit (large enough).
Is that actually a supported configuration?

> > > > If this is not the goal however, we should not pay for the overhead
> > > > of the extra page table in user space. I can see two other possible
> > > > solutions for the problem:
> > > > 
> > > > a) always use a four-level page table in kernel space, regardless of
> > > > whether we do it in user space. We can move the kernel mappings down
> > > > in address space either by one 512GB entry to 0xffffff0000000000, or
> > > > to match the 64k-page location at 0xfffffc0000000000, or all the way
> > > > to to 0xfffc000000000000. In any case, we can have all the dynamic
> > > > mappings within one 512GB area and pretend we have a three-level
> > > > page table for them, while the rest of DRAM is mapped statically at
> > > > early boot time using 512GB large pages.
> > > 
> > > That's a workaround but we end up with two (or more) kernel pgds - one
> > > for vmalloc, ioremap etc. and another (static) one for the kernel linear
> > > mapping. So far there isn't any memory mapping carved out but we have to
> > > be careful in the future.
> > > 
> > > However, kernel page table walking would be a bit slower with 4-levels.
> > 
> > Do we actually walk the kernel page tables that often? With what I suggested,
> > we can still pretend that it's 3-level for all practical purposes, since
> > you wouldn't walk the page tables for the linear mapping.
> 
> I was referring to hardware page table walk (TLB miss). Again, we need
> some benchmarks (it gets worse in a guest as it needs to walk the stage
> 2 for each stage 1 level miss; if you are really unlucky you can have up
> to 24 memory accesses for a TLB miss with two translation stages and 4
> levels each).

Ah right, of course. It would only be important for MMIO mappings though, as
the linear mapping can be done using 1GB or 512GB large pages, and these
tend to not cause noticeable overhead during lookup.

	Arnd