Unexpected issues with 2 NVME initiators using the same target
Max Gurtovoy
maxg at mellanox.com
Tue Jun 20 06:28:11 PDT 2017
On 6/20/2017 3:02 PM, Sagi Grimberg wrote:
>
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>>
>>> mkey.free == 1
>>
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>>
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>>
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
>
> Well, this entire flow is broken, why should the host send the MR rkey
> to the target if it is not using it for remote access, the target
> should never have a chance to remote invalidate something it did not
> access.
>
> I think we have a bug in iSER code, as we should not send the key
> for remote invalidation if we do inline data send...
>
> Robert, can you try the following:
> --
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c
> b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 12ed62ce9ff7..2a07692007bd 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,
>
> if (unsol_sz < edtl) {
> hdr->flags |= ISER_WSV;
> - hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> - hdr->write_va = cpu_to_be64(mem_reg->sge.addr +
> unsol_sz);
> + if (buf_out->data_len > imm_sz) {
> + hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> + hdr->write_va = cpu_to_be64(mem_reg->sge.addr +
> unsol_sz);
> + }
>
> iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
> "VA:%#llX + unsol:%d\n",
> --
>
> Although, I still don't think its enough. We need to delay the
> local invalidate till we received a send completion (guarantees
> that ack was received)...
>
> If this indeed the case, _all_ ULP initiator drivers share it because we
> never condition on a send completion in order to complete an I/O, and
> in the case of lost ack on send, looks like we need to... It *will* hurt
> performance.
>
> What do other folks think?
Sagi,
As I mentioned in the mail before (with my "patch") that your scenario
may happen and actually happened.
I agree we need to fix all the initiator ULP drivers.
>
> CC'ing Bart, Chuck, Christoph.
>
> Guys, for summary, I think we might have a broken behavior in the
> initiator mode drivers. We never condition send completions (for
> requests) before we complete an I/O. The issue is that the ack for those
> sends might get lost, which means that the HCA will retry them (dropped
> by the peer HCA) but if we happen to complete the I/O before, either we
> can unmap the request area, or for inline data, we invalidate it (so the
> HCA will try to access a MR which was invalidated).
>
> Signalling all send completions and also finishing I/Os only after we
> got them will add latency, and that sucks...
Yes, you're right. We must complete the IO after receiving both
send/recv completion.
I can run some performance checks with iSER/NVMf with signal on every
send and release the IO task only after both arrived.
More information about the Linux-nvme
mailing list