[PATCHv1] nvmet-rdma: Fix missing dma sync to nvme data structures

Tue Jan 17 00:07:39 PST 2017

>>> +	for (i = 0; i < 2; i++) {
>>> +		if (cmd->sge[i].length)
>>> +			ib_dma_sync_single_for_device(ndev->device,
>>> +				cmd->sge[0].addr, cmd->sge[0].length,
>>> +				DMA_FROM_DEVICE);
>>> +	}
>>
>> a. you test on sge[i] but sync sge[0].
> Crap code. I will fix this.
>
>> b. I don't think we need the for statement, lest keep it open-coded for [0]
>> and [1].
>
> I put for loop because, there was high level agreement in max_sge thread chain between Chuck, Steve and Jason about having generic sg_list/bounce buffer and doing things similar to RW APIs.
> Now if we generalize at that point, my thoughts were, that this code can eventually move out from every ULP to that generic send() API.
> So I put a for loop to make is more ULP agnostic from beginning.

Lets start simple and clear, if we indeed do this (and I should
checkout this discussion) we'll move it out like all the rest of
the ULPs.

>>>  	if (ndev->srq)
>>>  		return ib_post_srq_recv(ndev->srq, &cmd->wr, &bad_wr);
>> @@ -507,6
>>> +515,10 @@ static void nvmet_rdma_send_done(struct ib_cq *cq, struct
>> ib_wc *wc)
>>>  	struct nvmet_rdma_rsp *rsp =
>>>  		container_of(wc->wr_cqe, struct nvmet_rdma_rsp,
>> send_cqe);
>>>
>>> +	ib_dma_sync_single_for_cpu(rsp->queue->dev->device,
>>> +			rsp->send_sge.addr, rsp->send_sge.length,
>>> +			DMA_TO_DEVICE);
>>
>> Why do you need to sync_for_cpu here? you have no interest in the data at
>> this point.
>>
> Before a cqe can be prepared by cpu, it needs to be synced.
> So once CQE send is completed, that region is ready for preparing new CQE.
> In error case cqe is prepared by the RDMA layer and sent using nvmet_req_complete.
> In happy path case cqe is prepared by the core layer before invoking queue_response() callback of fabric_ops.
>
> In happy case nvmet-core needs to do the sync_to_cpu.
> In error case rdma layer needs to do the sync_to_cpu.
>
> Instead of messing code at both places, I did the sync_for_cpu in send_done() which is unified place.
> If there is generic callback in fabric_ops that can be invoked by __nvmet_req_complete(), than its cleaner to do at single place by invoking it.
> I didn't find it worth enough to increase fabric_ops for this.
> Let me know if you have different idea to resolve this.

Why not sync it when you start using the cmd at nvmet_rdma_recv_done()?
--
@@ -747,6 +747,10 @@ static void nvmet_rdma_recv_done(struct ib_cq *cq, 
struct ib_wc *wc)
         rsp->flags = 0;
         rsp->req.cmd = cmd->nvme_cmd;

+       ib_dma_sync_single_for_cpu(rsp->queue->dev->device,
+                       rsp->send_sge.addr, rsp->send_sge.length,
+                       DMA_TO_DEVICE);
+
         if (unlikely(queue->state != NVMET_RDMA_Q_LIVE)) {
                 unsigned long flags;
--

>
>>> nvmet_rdma_rsp *rsp)  static void nvmet_rdma_handle_command(struct
>> nvmet_rdma_queue *queue,
>>>  		struct nvmet_rdma_rsp *cmd)
>>>  {
>>> +	int i;
>>>  	u16 status;
>>>
>>>  	cmd->queue = queue;
>>>  	cmd->n_rdma = 0;
>>>  	cmd->req.port = queue->port;
>>>
>>> +	for (i = 0; i < 2; i++) {
>>> +		if (cmd->cmd->sge[i].length)
>>> +			ib_dma_sync_single_for_cpu(queue->dev->device,
>>> +				cmd->cmd->sge[i].addr, cmd->cmd-
>>> sge[i].length,
>>> +				DMA_FROM_DEVICE);
>>> +	}
>>
>> Again, we don't need the for statement.
>>
>> Also, I think we can optimize a bit by syncing the in-capsule page only if:
>> 1. it was posted for recv (sge has length) 2. its a write command 3. it has in-
>> capsule data.
>>
>> So, here lets sync the sqe (sge[0]) and sync the in-capsule page in
>> nvmet_rdma_map_sgl_inline().
>
> I agree Sagi that this can be differed to _inline(). I was headed to do this in generic code eventually similar to RW API.
> And thought why not have the clean code now so that migration later to new API would be easy.
> But if you think we should differ it to later stage, I am fine with that and continue with open coding.
>
> Now when I review the code of map_sgl_inline() I am wondering that we should not sync the INLINE data page at all because cpu is not going to read it anyway.
> Its only the remote device which will read/write and it will do the dma_sync_to_device as part of that driver anyway.

Yea, you're right. Let's kill it.