Unexpected issues with 2 NVME initiators using the same target

Tue Jun 20 05:02:17 PDT 2017

>>> Can you share the check that correlates to the vendor+hw syndrome?
>>
>> mkey.free == 1
> 
> Hmm, the way I understand it is that the HW is trying to access
> (locally via send) a MR which was already invalidated.
> 
> Thinking of this further, this can happen in a case where the target
> already completed the transaction, sent SEND_WITH_INVALIDATE but the
> original send ack was lost somewhere causing the device to retransmit
> from the MR (which was already invalidated). This is highly unlikely
> though.
> 
> Shouldn't this be protected somehow by the device?
> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
> 
> Say host register MR (a) and send (1) from that MR to a target,
> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> so it retries, but ehh, its already invalidated.

Well, this entire flow is broken, why should the host send the MR rkey
to the target if it is not using it for remote access, the target
should never have a chance to remote invalidate something it did not
access.

I think we have a bug in iSER code, as we should not send the key
for remote invalidation if we do inline data send...

Robert, can you try the following:
--

diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index 12ed62ce9ff7..2a07692007bd 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,

         if (unsol_sz < edtl) {
                 hdr->flags     |= ISER_WSV;
-               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
-               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
+               if (buf_out->data_len > imm_sz) {
+                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
+                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr + 
unsol_sz);
+               }

                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
                          "VA:%#llX + unsol:%d\n",
--

Although, I still don't think its enough. We need to delay the
local invalidate till we received a send completion (guarantees
that ack was received)...

If this indeed the case, _all_ ULP initiator drivers share it because we
never condition on a send completion in order to complete an I/O, and
in the case of lost ack on send, looks like we need to... It *will* hurt
performance.

What do other folks think?

CC'ing Bart, Chuck, Christoph.

Guys, for summary, I think we might have a broken behavior in the
initiator mode drivers. We never condition send completions (for
requests) before we complete an I/O. The issue is that the ack for those
sends might get lost, which means that the HCA will retry them (dropped
by the peer HCA) but if we happen to complete the I/O before, either we
can unmap the request area, or for inline data, we invalidate it (so the
HCA will try to access a MR which was invalidated).

Signalling all send completions and also finishing I/Os only after we
got them will add latency, and that sucks...