Unexpected issues with 2 NVME initiators using the same target

Mon Jul 10 12:03:18 PDT 2017

> On Jul 9, 2017, at 12:47 PM, Jason Gunthorpe <jgunthorpe at obsidianresearch.com> wrote:
> 
> On Sun, Jul 02, 2017 at 02:17:52PM -0400, Chuck Lever wrote:
> 
>> I could kmalloc the SGE array instead, signal each Send,
>> and then in the Send completion handler, unmap the SGEs
>> and then kfree the SGE array. That's a lot of overhead.
> 
> Usually after allocating the send queue you'd pre-allocate all the
> tracking memory needed for each SQE - eg enough information to do the
> dma unmaps/etc?

Right. In xprtrdma, the QP resources are allocated in rpcrdma_ep_create.
For every RPC-over-RDMA credit, rpcrdma_buffer_create allocates an
rpcrdma_req structure, which contains an ib_cqe and an array of SGEs for
the Send, and a number of other resources used to maintain registration
state during an RPC-over-RDMA call. Both of these functions are invoked
during transport instance set-up.

The problem is the lifetime for the rpcrdma_req structure. Currently it
is acquired when an RPC is started, and it is released when the RPC
terminates.

Inline send buffers are never unmapped until transport tear-down, but
since:

commit 655fec6987be05964e70c2e2efcbb253710e282f
Author:     Chuck Lever <chuck.lever at oracle.com>
AuthorDate: Thu Sep 15 10:57:24 2016 -0400
Commit:     Anna Schumaker <Anna.Schumaker at Netapp.com>
CommitDate: Mon Sep 19 13:08:38 2016 -0400

    xprtrdma: Use gathered Send for large inline messages

Part of the Send payload can come from page cache pages for NFS WRITE
and NFS SYMLINK operations. Send buffers that are page cache pages are
DMA unmapped when rpcrdma_req is released.

IIUC what Sagi found is that Send WRs can continue running even after an
RPC completes in certain pathological cases. Therefore the Send WR can
complete after the rpcrdma_req is released and page cache-related Send
buffers have been unmapped.

It's not an issue to make the RPC reply handler wait for Send completion.
In most cases, this is not going to add any additional latency because
the Send will complete long before the RPC reply arrives. By far the
common case, and that's an extra completion interrupt for nothing.

The problem arises if the RPC is terminated locally before the reply
arrives. Suppose, for example, user hits ^C, or a timer fires. Then the
rpcrdma_req can be released and re-used before the Send completes.
There's no way to make RPC completion wait for Send completion.

One option is to somehow split the Send-related data structures from
rpcrdma_req, and manage them independently. I've already done that for
MRs: MR state is now located in rpcrdma_mw.

If instead I just never DMA map page cache pages, then all Send buffers
are always left DMA mapped while the transport is active. There's no
problem there with Send retransmits. The overhead is that I have to
either copy data into the Send buffers, or force the server to use RDMA
Read, which has a palpable overhead.

>> Or I could revert all the "map page cache pages" logic and
>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>> the time. That keeps everything simple, but means large
>> inline thresholds can't use send-in-place.
> 
> Don't you have the same problem with RDMA WRITE?

The server side initiates RDMA Writes. The final RDMA Write in a WR
chain is signaled, but a subsequent Send completion is used to
determine when the server may release resources used for the Writes.
We're already doing it the slow way there, and there's no ^C hazard
on the server.

--
Chuck Lever