Unexpected issues with 2 NVME initiators using the same target
Sagi Grimberg
sagi at grimberg.me
Tue Jun 20 02:33:14 PDT 2017
>>> Here the parsed output, it says that it was access to mkey which is
>>> free.
Missed that :)
>>> ======== cqe_with_error ========
>>> wqe_id : 0x0
>>> srqn_usr_index : 0x0
>>> byte_cnt : 0x0
>>> hw_error_syndrome : 0x93
>>> hw_syndrome_type : 0x0
>>> vendor_error_syndrome : 0x52
>>
>> Can you share the check that correlates to the vendor+hw syndrome?
>
> mkey.free == 1
Hmm, the way I understand it is that the HW is trying to access
(locally via send) a MR which was already invalidated.
Thinking of this further, this can happen in a case where the target
already completed the transaction, sent SEND_WITH_INVALIDATE but the
original send ack was lost somewhere causing the device to retransmit
from the MR (which was already invalidated). This is highly unlikely
though.
Shouldn't this be protected somehow by the device?
Can someone explain why the above cannot happen? Jason? Liran? Anyone?
Say host register MR (a) and send (1) from that MR to a target,
send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
on MR (a) and the host HCA process it, then host HCA timeout on send (1)
so it retries, but ehh, its already invalidated.
Or, we can also have a race where we destroy all our MRs when I/O
is still running (but from the code we should be safe here).
Robert, when you rebooted the target, I assume iscsi ping
timeout expired and the connection teardown started correct?
More information about the Linux-nvme
mailing list