[PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

Sat Mar 17 08:05:20 PDT 2018

On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> >> and also look at your other comments.
> > 
> > I spoke too soon.
> > 
> > Now that I realized, code needs to follow one of the following patterns for correctness
> > 
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> > 
> > or 
> > 
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> > 
> > but definitely not
> > 
> > wmb()
> > __raw_wrltel()
> > 
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> > 
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> > 
> > So, there is no one size fits all solution with the current state of affairs.
> > 
> > 
> 
> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is OK.
> 
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>         int count = 8;
> 
>         while (count) {
> -               writeq(*src, dst);
> +               __raw_writeq(*src, dst);
>                 src++;
>                 dst++;
>                 count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb()
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>                                  (void *)wqe);
>                 } else {
>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb();

No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!

Your first patch was right, replacing
  wmb()
  writel()

With wmb(); writel_relaxed() is fine, and helps ARM.

The problem with PPC has nothing to do with the writel, it is with the
spinlock, and we can't improve it without adding some new
infrastructure. Certainly don't hack around the problem by using
__raw_writel in multi-arch drivers!

In userspace we settled on something like this as a pattern:

 mmio_wc_spin_lock()
 writel_wc_mmio()
 mmio_wc_spin_unlock()

Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
fully contain all MMIO access to WC and UC memory.

Due to our userspace the pattern is specifically used with MMIO writes
to WC BAR memory, but it could be generalized..

This allows all known arches to use the best code in all cases. x86
can even optimize a lot and combine the mmiowmb() into the
spin_unlock, apparently.

Jason