[RFC] ARM64: 4 level page table translation for 4KB pages

정성진 sungjinn.chung at samsung.com
Mon Mar 31 20:44:36 EDT 2014


On Tuesday, April 01, 2014 12:27 AM Catalin Marinas wrote:
> On Mon, Mar 31, 2014 at 01:53:20PM +0100, Arnd Bergmann wrote:
> > On Monday 31 March 2014 12:31:14 Catalin Marinas wrote:
> > > On Mon, Mar 31, 2014 at 07:56:53AM +0100, Arnd Bergmann wrote:
> > > > On Monday 31 March 2014 12:51:07 Jungseok Lee wrote:
> > > > > Current ARM64 kernel cannot support 4KB pages for 40-bit physical address
> > > > > space described in [1] due to one major issue + one minor issue.
> > > > >
> > > > > Firstly, kernel logical memory map (0xffffffc000000000-0xffffffffffffffff)
> > > > > cannot cover DRAM region from 544GB to 1024GB in [1]. Specifically, ARM64
> > > > > kernel fails to create mapping for this region in map_mem function
> > > > > (arch/arm64/mm/mmu.c) since __phys_to_virt for this region reaches to
> > > > > address overflow. I've used 3.14-rc8+Fast Models to validate the statement.
> > > >
> > > > It took me a while to understand what is going on, but it essentially comes
> > > > down to the logical memory map (0xffffffc000000000-0xffffffffffffffff)
> > > > being able to represent only RAM in the first 256GB of address space.
> > > >
> > > > More importantly, this means that any system following [1] will only be
> > > > able to use 32GB of RAM, which is a much more severe restriction than
> > > > what it sounds like at first.
> > >
> > > On a 64-bit platform, do we still need the alias at the bottom and the
> > > 512-544GB hole (even for 32-bit DMA, top address bits can be wired to
> > > 512GB)? Only the idmap would need 4 levels, but that's static, we don't
> > > need to switch Linux to 4-levels. Otherwise the memory is too sparse.
> >
> > I think we should keep a static virtual-to-physical mapping,
> 
> Just so that I understand: with a PHYS_OFFSET of 0?
> 
> > and to keep
> > relocating the kernel at compile time without a hack like ARM_PATCH_PHYS_VIRT
> > if at all possible.
> 
> and the kernel running at a virtual alias at a higher position than the
> end of the mapped RAM? IIUC x86_64 does something similar.
> 
> > > > > Secondly, vmemmap space is not enough to cover over about 585GB physical
> > > > > address space. Fortunately, this issue can be resolved as utilizing an extra
> > > > > vmemmap space (0xffffffbe00000000-0xffffffbffbbfffff) in [2]. However,
> > > > > it would not cover systems having a couple of terabytes DRAM.
> > > >
> > > > This one can be trivially changed by taking more space out of the vmalloc
> > > > area, to go much higher if necessary. vmemmap space is always just a fraction
> > > > of the linear mapping size, so we can accommodate it by definition if we
> > > > find space to fit the physical memory.
> > >
> > > vmemmap is the total range / page size * sizeof(struct page). So for 1TB
> > > range and 4K pages we would need 8GB (the current value, unless I
> > > miscalculated the above). Anyway, you can't cover 1TB range with
> > > 3-levels.
> >
> > The size of 'struct page' depends on a couple of configuration variables.
> > If they are all enabled, you might need a bit more, even for configurations
> > that don't have that much address space.
> 
> Yes. We could make vmemmap configurable at run-time or just go for a
> maximum value.
> 
> > > > > Therefore, it would be needed to implement 4 level page table translations
> > > > > for 4KB pages on 40-bit physical address space platforms. Someone might
> > > > > suggest use of 64KB pages in this case, but I'm not sure about how to
> > > > > deal with internal memory fragmentation.
> > > > >
> > > > > I would like to contribute 4 level page table translations to upstream,
> > > > > the target of which is 3.16 kernel, if there is no movement on it. I saw
> > > > > some related RFC patches a couple of months ago, but they didn't seem to
> > > > > be merged into maintainer's tree.
> > > >
> > > > I think you are answering the wrong question here. Four level page tables
> > > > should not be required to support >32GB of RAM, that would be very silly.
> > >
> > > I agree, we should only enable 4-levels of page table if we have close
> > > to 512GB of RAM or the range is too sparse but I would actually push
> > > back on the hardware guys to keep it tighter.
> >
> > But remember this part:
> >
> > > > There are good reasons to use a 50 bit virtual address space in user
> > > > land, e.g. for supporting data base applications that mmap huge files.
> >
> > You may actually need 4-level tables even if you have much less installed
> > memory, depending on how the application is written. Note that x86, powerpc
> > and s390 all chose to use 4-level tables for 64-bit kernels all the
> > time, even thought they can also use 3-level of 5-level in some cases.
> 
> I don't mind 4-level tables by default but I would still keep a
> configuration option (or at least doing some benchmarks to assess the
> impact before switching permanently to 4-levels). There are mobile
> platforms that don't really need as much VA space (and people are even
> talking about ILP32).

Hi,
How about keep 3-level table by default and enable 4-level table with 
config option?
Asymmetry level for kernel and user land would make code complicated.
And usually more memory means that user application tends to use more memory.
So I suggest same virtual space for both.

> 
> > > > If this is not the goal however, we should not pay for the overhead
> > > > of the extra page table in user space. I can see two other possible
> > > > solutions for the problem:
> > > >
> > > > a) always use a four-level page table in kernel space, regardless of
> > > > whether we do it in user space. We can move the kernel mappings down
> > > > in address space either by one 512GB entry to 0xffffff0000000000, or
> > > > to match the 64k-page location at 0xfffffc0000000000, or all the way
> > > > to to 0xfffc000000000000. In any case, we can have all the dynamic
> > > > mappings within one 512GB area and pretend we have a three-level
> > > > page table for them, while the rest of DRAM is mapped statically at
> > > > early boot time using 512GB large pages.
> > >
> > > That's a workaround but we end up with two (or more) kernel pgds - one
> > > for vmalloc, ioremap etc. and another (static) one for the kernel linear
> > > mapping. So far there isn't any memory mapping carved out but we have to
> > > be careful in the future.
> > >
> > > However, kernel page table walking would be a bit slower with 4-levels.
> >
> > Do we actually walk the kernel page tables that often? With what I suggested,
> > we can still pretend that it's 3-level for all practical purposes, since
> > you wouldn't walk the page tables for the linear mapping.
> 
> I was referring to hardware page table walk (TLB miss). Again, we need
> some benchmarks (it gets worse in a guest as it needs to walk the stage
> 2 for each stage 1 level miss; if you are really unlucky you can have up
> to 24 memory accesses for a TLB miss with two translation stages and 4
> levels each).
> 
> --
> Catalin




More information about the linux-arm-kernel mailing list