USB mass storage and ARM cache coherency

Benjamin Herrenschmidt benh at kernel.crashing.org
Thu Mar 4 23:44:55 EST 2010


> Basically we have two different MMUs on VIPT parts, the older one on all
> SH-4 parts were all read-implies-exec with no ability to differentiate
> between read or exec access. 

Ok, this is the same as the older ppc32 processors.

> For these parts the PG_dcache_dirty approach
> saves us from a lot of flushing, and the corner cases were isolated
> enough that we could tolerate fixups at the driver level, even on a
> write-allocate D-cache.

But how wide a range of devices do you have to support with those ? Is
this a few SoCs or people putting any random PCI device in there for
example ?

If I were to do it that way on ppc32, I worried that it would be more
than a few drivers that I would have to fix :-) All the 32-bit PowerMac
and PowerBooks for example, all of freescale 74xx based parts, etc...
those guys have PCI, and all sort of random HW plugged into them.

I would -love- to avoid that horrible amount of flushing we do on these,
it's quite high on any profile run, but I haven't found a good way to do
so. There's also a nasty issue of icache content leaking between
processes which I doubt is exploitable but I had people having a go at
me about it when I tried to avoid icache cleaning anonymous pages by
default.

> For second generation SH-4A (SH-X2) and up parts, read and exec are split
> out and we could reasonably adopt the PG_dcache_clean approach there
> while adopting the same sort of flushing semantics as PPC to avoid
> flushing constantly. The current generation of parts far outnumber their
> legacy counterparts, so it's certainly something I plan to experiment
> with.

I'd be curious to see whether you get a perf imporovement with that.

Note that we still have this additional thing that is floating around in
this thread which I thing is definitely worthwhile to do, which is to
mark clean pages that have been written to with DMA in dma_unmap and
friends.... if we can fix the icache problem. So far, I haven't found
James replies on this satisfactory :-) But maybe I just missed
something.

> We have an additional level of complexity on some of the SMP parts with a
> non-coherent I-cache,

I've that on some embedded ppc's too, where the icache flush instrutions
aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
today nobody sane (appart from Bluegene) did an SMP part with those and
so we have well localized internal hacks for them. But I've heared that
some vendors might be pumping out SoCs with that stuff too soon which
worries me.

>  some of the early CPUs have broken broadcasting of
> the cacheops in hardware and so need to rely on IPIs, while the later
> parts broadcast properly. We also need to deal with D-cache IPIs when
> using mixed coherency protocols on different CPUs.

Right, that sucks. Do those have no-exec permission support ? If they
do, then you can do what I did for BG, which is to ping pong user pages
so they are either writable or executable (since userspace code itself
will break as it will assume the cache ops -are- broadcast, since that's
what the architecture says).

> For older PIPT parts we've never used the deferred flush, since the only
> time we ever had to bother with cache maintenance was in the DMA ops, as
> anything closer to the CPU than the PCI DMAC had no opportunity to be
> snooped.

Do you also, like ARM11MP, have a case of non-cache coherent DMA and
non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
don't see how it can be dealt properly other than using load/store
tricks to bring the data into the local cache and flushing it from
there. DMA ops are called way to deep into spinlock hell to rely on IPIs
(unless your HW also provides some kind of NMI IPIs).

> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> > 
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> > 
> Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> and certainly worth experimenting with. I don't know how we would do the
> I-cache optimization without a PG_arch_2, though.

Right. That's the one thing I've been trying to figure out without
success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
like it to me...

> In any event, if there's going to be a mass exodus to PG_dcache_clean,
> Documentation/cachetlb.txt could use a considerable amount of expanding.
> The read/exec and I-cache optimizations are something that would be
> valuable to document, as opposed to simply being pointed at the sparc64
> approach with the regular PG_dcache_dirty caveats.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/





More information about the linux-arm-kernel mailing list