[PATCH rfc] nvme-pci: make sure to flush sqe writes before db record update

Jason Gunthorpe jgg at ziepe.ca
Thu Mar 8 10:36:23 PST 2018


On Thu, Mar 08, 2018 at 07:56:37PM +0200, Sagi Grimberg wrote:
> 
> >>>@@ -437,8 +437,14 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >>>  	if (++tail == nvmeq->q_depth)
> >>>  		tail = 0;
> >>>  	if (nvme_dbbuf_update_and_check_event(tail, nvmeq->dbbuf_sq_db,
> >>>-					      nvmeq->dbbuf_sq_ei))
> >>>+					      nvmeq->dbbuf_sq_ei)) {
> >>>+		/*
> >>>+		 * Make sure that descriptors are written before
> >>>+		 * doorbell record.
> >>>+		 */
> >>>+		wmb();
> >>>  		writel(tail, nvmeq->q_db);
> >>>+	}
> >>>  	nvmeq->sq_tail = tail;
> >>>  }
> >>
> >>If this really is necessary, we'd need this before updating the event
> >>shadow registers too.
> >>
> >>I'd like to understand this a bit more as we haven't done this in eight
> >>years and I can't recall any issues around this section. Have we just
> >>been fortunate that the problem this fixes is extraordinarily unlikely,
> >>or is something else implicitly ordering within this critical section?
> >
> >Well, there is a wmb() already inside
> >nvme_dbbuf_update_and_check_event so any failure would only
> >be related to dbbuf_sq_db being wrong when tail is written to q_db.
> 
> Right, we have that covered.
> 
> >Guessing that might be a basically undetectable situation, maybe some
> >temporary higher latency or something?
> 
> If the SQE and DB update have been reordered,

But that can't happen, the SQE is written before
nvme_dbbuf_update_and_check_event(), and that function does wmb.

The only reordering is related to this:

		wmb();

		old_value = *dbbuf_db;
		*dbbuf_db = value;
[..]
	writel(tail, nvmeq->q_db);

So q_db and dbbuf_db could be swapped at the worst.

Jason



More information about the Linux-nvme mailing list