[EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors

Frank Li frank.li at nxp.com
Mon Jun 21 09:11:57 PDT 2021



> -----Original Message-----
> From: Will Deacon <will at kernel.org>
> Sent: Thursday, June 17, 2021 4:40 PM
> To: Frank Li <frank.li at nxp.com>
> Cc: Catalin Marinas <catalin.marinas at arm.com>; Zhi Li <lznuaa at gmail.com>;
> Shenwei Wang <shenwei.wang at nxp.com>; Han Xu <han.xu at nxp.com>; Nitin Garg
> <nitin.garg at nxp.com>; Jason Liu <jason.hui.liu at nxp.com>; linux-arm-
> kernel at lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> On Thu, Jun 17, 2021 at 08:11:50PM +0000, Frank Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Will Deacon <will at kernel.org>
> > > Sent: Thursday, June 17, 2021 12:42 PM
> > > To: Catalin Marinas <catalin.marinas at arm.com>
> > > Cc: Zhi Li <lznuaa at gmail.com>; Frank Li <frank.li at nxp.com>; Shenwei
> Wang
> > > <shenwei.wang at nxp.com>; Han Xu <han.xu at nxp.com>; Nitin Garg
> > > <nitin.garg at nxp.com>; Jason Liu <jason.hui.liu at nxp.com>; linux-arm-
> > > kernel at lists.infradead.org
> > > Subject: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in
> > > default I/O accessors
> > >
> > > Caution: EXT Email
> > >
> > > On Thu, Jun 17, 2021 at 06:25:28PM +0100, Will Deacon wrote:
> > > > On Thu, Jun 17, 2021 at 10:27:44AM +0100, Catalin Marinas wrote:
> > > > > On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> > > > > > On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li at nxp.com> wrote:
> > > > > > > Will Deacon wrote:
> > > > > > > > It would also be helpful to know a bit more about the
> hardware:
> > > > > > > >
> > > > > > > >   - What is the "internal bus fabric"?
> > > > > >
> > > > > > > Look like ARM call as "Interconnect",  Multi AXI master and
> multi
> > > AXI slave
> > > > > > > connected together.
> > > > > >
> > > > > > I  drawed simplified bus structure.
> > > > > >
> > > > > >         ┌──────┐ ┌────┐
> > > > > >         │ A53  │ │A72 │
> > > > > >         └───┬──┘ └─┬──┘
> > > > > >             │      │
> > > > > >         ┌───▼──────▼──┐
> > > > > >         │    CCI400   │
> > > > > >         └─────┬───────┘
> > > > > >               │   1 (a)write to ddr (normal uncached memory)
> > > > > >               │   DMB OSHST
> > > > > >               │   2 (b)write to usb register(device, nGnRE)
> > > > > >         ┌─────▼───────────────────────┐
>> > > ───────────┐
> > > > > >         │                             ◄───────┤   GPU
>> > > > > >         │     Bus fabric              │       │           │
> > > > > >         └────────────────────────────┬┘
>> > > ───────────┘
> > > > > > 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
> > > > > >          │        │   ddr        │   │
> > > > > >       ┌──▼────────┴─┐            │   │
> > > > > >       │             │            │   │
> > > > > >       │  USB        │      5.usb │   │
> > > > > >       │             │      read  │   │
> > > > > >       └─────────────┘            │   │
> > > > > >                                ┌─┴───▼─┐
> > > > > >                                │       │
> > > > > >                                │ DDR   │
> > > > > >                                │       │
> > > > > >                                └───────┘
> > > > >
> > > > > Since you sent an HTML message, it was rejected by the list server.
> The
> > > > > above is a plain-text rendition by w3m (and changed barrier() to
> DMB
> > > > > OSHST).
> > > > >
> > > > > Is the DMB propagated to the bus fabric? IIUC, our logic is that if
> the
> > > > > write (b) to USB is observable by, let's say, the GPU, the same GPU
> > > > > should also observe the write (a) to DDR. Since the write (a) to
> DDR is
> > > > > globally observable, the USB device read at (4) should also observe
> it
> > > > > (well, we may be wrong).
> > > >
> > > > It's pretty rare for barriers to propagate onto the fabric -- usually
> the
> > > > CPU just orders everything based on acknowledgements. If the CCI
> gives
> > > the
> > > > write response for the non-cacheable write I could see that causing
> an
> > > issue
> > > > if the bus fabric can then reorder accesses, but then I would argue
> > > that's a
> > > > broken system because simple ring buffers in non-cacheable memory
> would
> > > fail
> >
> > Bus fabric don't reorder the same axi master.
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felinux.or
> g%2Fimages%2F7%2F73%2FDeacon-weak-to-
> weedy.pdf&data=04%7C01%7Cfrank.li%40nxp.com%7C5e6b6690d52d4e31d3a408d93
> 1d88105%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637595628211882416%7CU
> nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> JXVCI6Mn0%3D%7C1000&sdata=%2BEu10nmFVE1w3fBP11rXD8Wk1vVcvYLirjZQEhSIKCM
> %3D&reserved=0
> > Page 42 show race condition. I think above race condition happen at our
> system.
> > I am not sure if it is exist at Armv8 system.
> 
> Just a word of warning here, but the Armv8 memory model was
> *retrospectively* strengthened since I gave that talk, so the stuff in that
> pdf is out of date (and wrong).
> 
> > > > for peripherals hooking into the bus fabric (i.e. dma_*mb() would be
> > > > broken). I think it would also mean that DSB doesn't necessarily fix
> the
> > > > issue, it probably just makes it less likely because it takes longer
> to
> > > > get the device write out after the acknowledgement -- ndelay() would
> > > achieve
> > > > the same effect :)
> >
> > That's what I worried.
> >
> > > >
> > > > Frank -- what happens if you try either DMB SY, or DMB OSH (without
> the
> > > ST)
> > > > in writel()?
> >
> > It works well for 2 hours! Normally, problem happen below 10min. So I
> think DMB SY
> > can fix it.
> 
> Oh, interesting. Maybe this is a case where OSH vs SY actually makes a
> difference. I'm not quite sure what it means for the coherency of normal,
> non-cacheable accesses (which are outer-shareable) so that probably needs a
> bit more thought.
> 
> Can you confirm that the issue *does* still occur if you use dmb(osh)
> instead of dmb(oshst), please?

After get ARM support https://services.arm.com/support/s/case/5003t00001RuJHw, 
This issue have some progress. 

Our system configure SYSBARDISABLE = 0x0, So ARM core barrier propagate to CCI-400

Our DMA and USB is located below downstream of CCI-400. So USB or DMA is located
in system shared domain. Only use dmb(st), CCI-400 wait for previous transaction
Complete. When dma(osh), the response is sent when snoop responses are received for
all earlier transactions. CCI-400 don't wait for previous write finish. 

Best regards
Frank Li

> 
> Will


More information about the linux-arm-kernel mailing list