[PATCH v4] mtd: nand: sunxi: fix OOB handling in ->write_xxx() functions

Mon Sep 14 04:49:12 PDT 2015

On Monday 14 September 2015 11:41:13 Boris Brezillon wrote:
> Hi Arnd,
> 
> On Mon, 14 Sep 2015 10:59:02 +0200
> Arnd Bergmann <arnd at arndb.de> wrote:
> 
> > On Monday 14 September 2015 10:41:03 Boris Brezillon wrote:
> > >                 /* Fill OOB data in */
> > > -               if (oob_required) {
> > > -                       tmp = 0xffffffff;
> > > -                       memcpy_toio(nfc->regs + NFC_REG_USER_DATA_BASE, &tmp,
> > > -                                   4);
> > > -               } else {
> > > -                       memcpy_toio(nfc->regs + NFC_REG_USER_DATA_BASE,
> > > -                                   chip->oob_poi + offset - mtd->writesize,
> > > -                                   4);
> > > -               }
> > > +               writel(NFC_BUF_TO_USER_DATA(chip->oob_poi +
> > > +                                           layout->oobfree[i].offset),
> > > +                      nfc->regs + NFC_REG_USER_DATA_BASE);
> > 
> > This looks like you are changing the endianess of the data that gets written.
> > Is that intentional?
> 
> Hm, the real goal of this patch was to avoid accessing the
> NFC_REG_USER_DATA_BASE register using byte accessors (writeb()).
> The first version of this series was directly copying data from the
> buffer into a temporary u32 variable, thus forcing the data to be stored
> in little endian (tell me if I'm wrong), and then changing endianness
> using le32_to_cpu().
> Brian suggested to use __raw_writel() (as you seem to suggest too), but
> I was worried about the missing mem barrier in this function.
> That's why I made my own macro doing the little endian to CPU conversion
> manually, but still using the standard writel() accessor (which will
> do the conversion in reverse order).

D'oh, I totally missed your open-coded le32_to_cpu macro.
So your code does look correct to me, it's just a little inefficient
on big-endian machines because you end up swapping twice.

> Maybe I should use __raw_writel() and add an explicit memory barrier.

That would work, and avoid the double swapping, yes. Or you could
use writesl() with a length of one, which should have all the
necessary barriers but no byteswap.

I don't think you need a barrier on ARM here (no DMA that can interfere),
but it's better write the code architecture independent (as you did
above).

The memcpy_toio() is definitely overkill as it has a barrier after every
single byte, where you need at most one for this kind of driver.

> > memcpy_toio() uses the same endianess for source and destination, while writel()
> > assumes that the destination is a little-endian register, and that could break
> > if the kernel is built to run as big-endian. I also see that sunxi_nfc_write_buf()
> > uses memcpy_toio() for writing the actual data, and you are not changing that.
> 
> AFAIU the peripheral is always in little endian, and only the CPU can
> be switched to big endian, right?

Correct.

> Are you saying that memcpy_toio() uses writel? Because according to
> this implementation [2] it uses writeb, which should be safe (accessing
> the internal SRAM using byte accessors is authorized).

No, what I meant is that you are replacing some memcpy_toio() with
writel(), but don't replace some of the others that should/could also
be replaced.

As mentioned, I was under the impression that you changed the endianess
for the OOB data but did not change the endianess for the user data,
which would be inconsistent.

> > If all hardware can do 32-bit accesses here and the size is guaranteed to be a
> > multiple of four bytes, you can probably improve performance by using a
> > __raw_writel() loop there. Using __raw_writel() in general is almost always
> > a bug, but here it actually makes sense. See also the powerpc implementation
> > of _memcpy_toio().
> 
> AFAICT, buffer passed to ->write_bu() are not necessarily aligned on
> 32bits, so using writel here might require copying data in temporary
> buffers :-/.
> 
> Don't hesitate to point where I'm wrong ;-).

Brian or Dwmw2 should be able to know for sure. I think it's definitely
worth trying as the potential performance gains could be huge, if you
replace

	for (p = start; p < start + length; data++, p++) {
		writeb(*data, p);
		wmb();
	}

with

	for (p = start; p < start + length; data++, p+=4) {
		writel(*data, p);
	};
	wmb();

	Arnd