[PATCH] ARM: io: avoid writeback addressing modes for __raw_ accessors
Nicolas Pitre
nico at fluxnic.net
Mon Aug 20 09:29:31 EDT 2012
On Mon, 20 Aug 2012, Will Deacon wrote:
> Hi Nicolas,
>
> [apologies in advance for the long reply]
>
> On Fri, Aug 17, 2012 at 04:43:01AM +0100, Nicolas Pitre wrote:
> > On Tue, 14 Aug 2012, Will Deacon wrote:
> >
> > > Data aborts taken to hyp mode do not provide a valid instruction
> > > syndrome field in the HSR if the faulting instruction is a memory
> > > access using a writeback addressing mode.
> > >
> > > For hypervisors emulating MMIO accesses to virtual peripherals, taking
> > > such an exception requires disassembling the faulting instruction in
> > > order to determine the behaviour of the access. Since this requires
> > > manually walking the two stages of translation, the world must be
> > > stopped to prevent races against page aging in the guest, where the
> > > first-stage translation is invalidated after the hypervisor has
> > > translated to an IPA and the physical page is reused for something else.
> > >
> > > This patch avoids taking this heavy performance penalty when running
> > > Linux as a guest by ensuring that our I/O accessors do not make use of
> > > writeback addressing modes.
> >
> > How often does this happen? I don't really see writeback as a common
> > pattern for IO access.
>
> Building a Thumb-2 kernel with GCC:
>
> gcc version 4.6.3 20120201 (prerelease) (crosstool-NG
> linaro-1.13.1-2012.02-20120222 - Linaro GCC 2012.02)
>
> Translates the following code from amba_device_add:
>
> for (pid = 0, i = 0; i < 4; i++)
> pid |= (readl(tmp + size - 0x20 + 4 * i) & 255) <<
> (i * 8);
> for (cid = 0, i = 0; i < 4; i++)
> cid |= (readl(tmp + size - 0x10 + 4 * i) & 255) <<
> (i * 8);
>
> into:
[...]
OK, I can see how the compiler will fold the loop increment into the IO
access instruction. But my point is: is this common? And when this
happens, is this a critical path?
> > What does happen quite a lot, though, is pre-indexed addressing. For
> > example, let's take this code which is fairly typical of driver code:
> >
> > #define HW_REG1 0x10
> > #define HW_REG2 0x14
> > #define HW_REG3 0x18
> > #define HW_REG4 0x30
> >
> > int hw_init(void __iomem *ioaddr)
> > {
> > writel(0, ioaddr + HW_REG1)
> > writel(-1, ioaddr + HW_REG2);
> > writel(readl(ioaddr + HW_REG3) | 0xff, ioaddr + HW_REG4);
> > return 0;
> > }
> >
> > Right now this produces this:
> >
> > hw_init:
> > mov r3, r0
> > mvn r2, #0
> > mov r0, #0
> > str r0, [r3, #16]
> > str r2, [r3, #20]
> > ldr r2, [r3, #24]
> > orr r2, r2, #255
> > str r2, [r3, #48]
> > bx lr
>
> Well, that's not quite true for CONFIG_ARM_DMA_MEM_BUFFERABLE=y.
True. I turned those readl() into their raw counterparts to generate
the assembly but didn't update the example code in my mailer.
> whilst the addressing modes are still nice, the dsb is going to be the
> performance limitation here. That said, we could try the "Qo" constraints,
> which I seem to remember don't generate writebacks. I'll have a play.
OK. That would be excellent.
> > So, is the performance claim something that you've actually measured
> > with a real system, or was it only theoretical?
>
> The difference is down to the work done by the hypvervisor: an MMIO access
> will trap to hyp mode, where the HSR describes the instruction (access size,
> load/store, Rt, signed etc). For the case of a writeback instruction, this
> information is not provided by the hardware. Instead, the hypervisor has to
> disassemble the faulting instruction and work out what it's doing at which
> address prior to emulation.
>
> Even if that cost was acceptable, the problem then gets worse. Imagine that
> a guest MMIO access faults into the hypervisor, where the emulation code
> tries to decode the instruction because the fault information is incomplete.
> To do this, it must obtain the *physical* address of the faulting text page
> so that it can load the instruction. This happens via the ATS12NSOP{R,W}
> registers, which return a PA for the faulting VA (i.e. both stages of
> translation).
>
> Now, let's say the hypervisor has got hold of a PA but hasn't yet loaded the
> instruction. Meanwhile, another virtual CPU running the same guest decides
> (due to page aging or whatnot) to reclaim the text page containing the
> faulting instruction. It writes a faulting pte and does a TLB invalidation,
> however this is too late for the hypervisor, who has already translated its
> address. Furthermore, let's say that the guest then reuses the same physical
> page for something like a network buffer. The hypervisor goes ahead and grabs
> what it thinks is the faulting instruction from memory but in fact gets a
> load of random network data!
>
> To deal with this, the hypervisor will likely have to stop the virtual world
> when emulating any MMIO accesses that report incomplete fault information to
> avoid racing with a TLB invalidation from another virtual CPU. That will
> certainly be more expensive than an additional instruction on each access.
I totally agree with you here.
However, for completeness and above all for security reasons, the
hypervisor will _ahve_ to support that case anyway.
So it is now a matter of compromise between performance and code size.
If the pathological case you brought up above is the exception and not
the rule then I think that we can live with the performance impact in
that case and keep the optimal pre-indexed addressing for the common
cases.
Nicolas
More information about the linux-arm-kernel
mailing list