[PATCH v11 00/25] nvme-tcp receive offloads

Wed Mar 22 03:19:25 PDT 2023

Hi Sagi

On Tue, 7 Mar 2023 at 12:11, Sagi Grimberg sagi at grimberg.me wrote:
> >> Have you considered using a more standard interfaces to use this
> >> such that spdk or an io_uring based initiator can use it?
> >
> > The main problem which I will explain in more detail (under the digest,
> > teardown and resync flows) is that in order to use a more standard interface
> > that will hide what the HW needs it will require duplicating the NVMeTCP
> > logic of tracking the PDUs in the TCP stream – this seems wrong to us.
> 
> Hmm. Its not really the entire logic, its only the pdu/data
> boundaries. Every data coming to the host is structured, with a standard
> C2HData PDU.

In order to track the PDUs we will need to duplicate the parsing and tracking 
for all the common headers of all the PDU types, that is doable, but it will 
also need to be trustworthy, which means duplicating the header digest 
verification if enabled.

> 
> If we assume for the sake of the discussion that such a parser exist,
> how would the interface to the ulp look like?

Enablement could be done via setsockopt(). We would add a new 
SOL_NVME_TCP flag and store its state in new socket struct fields.
We would also set the resync callbacks from ipvX/tcp.c (explained further 
down).

Per IO setup:
-------------
The ddp_setup() call would be moved to ipv4/ipv6 implementations of TCP, 
in sock_recvmsg().
As you suggested, we would add a new MSG_NVME_DDP flag to register an 
expected NVMeTCP data response PDU.             
Adding a command_id field to the msghdr struct would break
userspace ABI. Instead we would use msghdr ancillary fields and add a new
cmsg_level (SOL_NVME_TCP) and a new cmsg_type (NVME_TCP_COMMAND_ID).

      struct msghdr msg = {.flags = MSG_NVME_DDP};
      int cmsg_len = sizeof(u32); // command_id size
      struct cmsghdr *cmsg;
      char buf[CMSG_SPACE(cmsg_len)];

      msg.msg_control = buf;
      msg.msg_controllen = sizeof(buf);
      cmsg = CMSG_FIRSTHDR(&msg);
      cmsg->cmsg_level = SOL_NVME_TCP;
      cmsg->cmsg_type = NVME_TCP_COMMAND_ID;
      cmsg->cmsg_len = CMSG_LEN(cmsg_len);
      *CMSG_DATA(cmsg) = 0x1; // command_id
      msg.msg_controllen = cmsg->cmsg_len;

      iov_iter_bvec(&msg.msg_iter, ITER_DEST, vec, nr_bvec, size);
      sock_recvmsg(queue->sock, &msg, msg.msg_flags);

Per IO teardown:
----------------
The ddp_teardown will need to remain as a separate API, as explained before, 
it is needed in order to verify that the HW teardown flow is complete before 
posting the NVMe completion. 

It’s not possible to move the teardown into sock_recvmsg() as it adds a
significant delay (~1 usec) and will not allow us to process other PDUs in the
meantime.

The challenge here is that we haven’t found a good and inexpensive reference in
the socket API for such IO notifications (I assume we don’t want to use 
signals).
This might block the socket API approach.

Data Digest offload:
--------------------
Since recvmsg() would only read a single PDU, your suggested 
approach could work:

          ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
          if (ddp_ddgst->valid) {
                 great...
          } else {
                 recalc_ddgst
          }

Receive flow:
-------------
Using recvmsg() doesn't have the callback mechanism that sock->read_sock() 
provides so the logic in the NVMeTCP driver code would have to be changed.
Instead of parsing and copying the PDU in chunks as it comes via the callback, 
it would have to process the entire PDU upon returning from recvmsg() in one go. 

To handle re-ordered NVMe-TCP responses, the ULP would duplicate the
tracking of PDUs and maintain a list of completed responses yet to be 
received by the NVMeTCP driver.
The NVMeTCP driver will call the recvmsg(s, &ddp_buf, cccid) only for C2H PDUs, 
and based on the PDU header type and command_id.

If placement was successful, the data is already in ddp_buf. If not, the kernel
would perform the regular software copy. It would be transparent for the 
NVMeTCP driver.

As mentioned before, in order to safely track the PDUs, we will
need to duplicate the header digest verification if enabled.

In summary the flow for NVMeTCP driver could be something like:

// Enable nvme-tcp offload
                setsockopt(s, SOL_NVME_TCP, ...);

...

// Command sending:

                // register memory
                msg.flags = MSG_NVME_DDP;
                msg.iov_iter = resp_buf;

                // set 1 msg ancillary field with the command id
                cmsg->cmsg_level = SOL_NVME_TCP;
                cmsg->cmsg_type = NVME_TCP_COMMAND_ID;
                cmsg->cmsg_len = CMSG_LEN(sizeof(u32));
                *CMSG_DATA(cmsg) = 0x1; // command id
                recvmsg(s, msg, msg.flags);

                // send the request
                send(s, req_buf, ...);

...

// C2H receive:

We will need to modify nvme_tcp_recv_pdu() to read only the next 
PDU header and according to the pdu type and command_id perform 
the following:

      ddp_buf = get_cid_buffer(cid);
      msg.iov_iter = ddp_buf;
      // set 1 msg ancillary field with the command id
      cmsg->cmsg_level = SOL_NVME_TCP;
      cmsg->cmsg_type = NVME_TCP_COMMAND_ID;
      cmsg->cmsg_len = CMSG_LEN(sizeof(u32));
      *CMSG_DATA(cmsg) = cid;
      recvmsg(s, msg, msg.flags);

      // Teardown (for this we don't have a good solution)

      // Data digest check
      ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
      if (ddp_ddgst->valid) {
            great...
      } else {
            recalc_ddgst
      }    

Resync:
-------
This API could be removed from the NVMeTCP driver, and we could move 
the processing of resync request to ipv4/ipv6 tcp code. 
The resync pending state would be moved to the socket struct.
But also in this case, in order to safely track the PDUs, we will need to 
parse all PDUs and to also calculate the header digest if enabled.

User space:
-----------
Regarding the user space design, since issuing 3 syscalls per IO (recvmsg + 
sendmsg + recvmsg) would be too expensive, we could add support for 
io_uring so it will wrap the above APIs.
For example, similarly to recvmsg(), we would submit a IORING_OP_RECVMSG 
sqe before each IO (with the MSG_NVME_DDP flag) to register memory for DDP.

> 
> > I can add that we are also working on an end-to-end user-space design for SPDK
> > and we don’t see how the two designs could use the same socket APIs without
> > impacting the performance gain of both cases.
> 
> Can you share a bit more how this will be done for spdk? spdk runs on
> normal sockets today, hence I'd love to know how you plan to introduce
> it there.

The SPDK developed at Nvidia can use XLIO
https://docs.nvidia.com/networking/display/XLIOv214 as the TCP stack.
It will be out of the context of this discussion as it is not following the 
normal socket operations.

> 
> >
> >>
> >> To me it appears that:
> >> - ddp limits can be obtained via getsockopt
> >> - sk_add/sk_del can be done via setsockopt
> >> - offloaded DDGST crc can be obtained via something like
> >>     msghdr.msg_control
> >
> > If we wish to hide it from the NVMeTCP driver it will require us to duplicate
> > the NVMeTCP logic of tracking the PDUs in the TCP stream.
> > In the positive case, when there are no DDGST errors, nothing is needed to be
> > done in the NVMeTCP driver, but in the case of errors (or just in the case of
> > out of order packet in the middle of the PDU), the DDGST will need to be
> > calculated in SW and it seems wrong to us to duplicate it outside of the
> > NVMeTCP driver.
> 
> I didn't suggest that the recalc would be hidden, if the calculation
> failed the ULP can do it.
> 
> Something like:
>         ddp_ddgst = CMSG_DATA(CMSG_FIRSTHDR(msg))
>         if (ddp_ddgst->valid) {
>                 great...
>         } else {
>                 recalc_ddgst
>         }

Yes, we agree, as explained before.

> 
> >
> >> - Perhaps for setting up the offload per IO, recvmsg would be the
> >>     vehicle with a new msg flag MSG_RCV_DDP or something, that would hide
> >>     all the details of what the HW needs (the command_id would be set
> >>     somewhere in the msghdr).
> >
> > Our design includes the following steps per IO:
> > - ddp_setup (register the command id and buffer to the HW)
> > - The existing flow which sends the command
> > - The existing flow of read_sock()
> > - ddp_teardown (for the per IO HW teardown, before posting the NVMe completion)
> >
> > Using the recvmsg will only replace the read_sock() but this part in the NVMeTCP
> > driver is not impacted by the offload design.
> > The ddp_setup is needed in order to set the command id and the buffer, and this
> > needs to be done before or as part of the sending of command and prior to the
> > receive flow.
> 
> I agree, I didn't suggest that replacing the entire read_sock flow
> (although that can change to recvmsg as well).
> 
> I was thinking more along the lines of MSG_PEEK that doesn't actually
> reads anything, where the driver prior to sending the command, would do:
> 
>         // do a DDP recvmsg to setup direct data placement
>         msg = { .msg_flags = MSG_DDP_SETUP };
>         iov_iter_bvec(&msg.msg_iter, ITER_DEST, vec, nr_bvec, size);
>         sock_recvmsg(queue->sock, &msg, msg.msg_flags);
> 
>         ...
> 
>         // now send the cmd to the controller
>         nvme_tcp_setup_cmd_pdu
>         nvme_tcp_try_send_cmd_pdu
>         ...
> 
> That is what I was thinking about.

Was answered as part of the “receive flow” section above.

> 
> > In addition, without duplicating the tracking of the PDUs in the TCP stream,
> > it is not possible to hide the teardown flow from the NVMeTCP driver.
> 
> The teardown flow is a bummer because it delays the nvme completion.
> Is it ok to dma_unmap and then schedule a ddp teardown async without
> delaying the nvme completion?

With the existing HW it will not be possible. The teardown is required in 
order to guarantee that the HW will not try accessing the HW resources 
or the IO buffer for which nvme completion was already called.

In a future HW, the teardown flow could improve.

> 
> How does TLS offload handles the async teardown?

In TLS we don’t have such operation per IO. Only the resync is similar.

> 
> >
> >> - And all of the resync flow would be something that a separate
> >>     ulp socket provider would take care of. Similar to how TLS presents
> >>     itself to a tcp application. So the application does not need to be
> >>     aware of it.
> >
> > The resync flow requires awareness of TCP sequence numbers and NVMe
> > PDUs boundaries. If we hide it from the NVMeTCP driver we would have
> > to again duplicate NVMe PDU tracking code.
> 
> Its really a pretty trivial part of the pdu tracking. Just when
> a C2HData PDU starts...

The resync is not only in the context of the C2HData PDUs, 
As explained before, it will require to parse the common header and validate 
the header digest in order to verify the resync of the HW. 
We really don’t like this duplication.

> 
> > TLS is a stream protocol and maps cleanly to TCP socket operations.
> > NVMeTCP on the other hand is a request-response protocol.
> > On the data path, it's not comparable.
> 
> Is it? looks similar to me in the sense that the input stream needs to
> know when a msg starts and ends. The only difference is that the ULP
> needs to timely provide the context to the offload.

It's a big difference, the app that is using the TLS doesn’t need to handle 
the IO path operations that NVMeTCP offload requires (ddp_setup, ddp_treadown).

> 
> > While it could be done by contorting recvmsg() with new flags, adding input
> > fields to the msghdr struct and changing the behaviour of uAPI, it will add
> > a lot more friction than a separate ULP DDP API specifically made for this
> > purpose.
> 
> This may be true. I was merely asking if this was a path that was
> genuinely explored and ruled out. To me, it sounds like it can
> work, and also allow userspace to automatically gain from it.

We have explored different approaches but ultimately picked this one.
While it might be more intrusive, it offers the better compromise with 
performance and cleanliness (avoiding code duplication and major 
refactoring).

The socket API:
1. Won’t simplify the nvme-tcp code: it will arguably make it more complex.
That is, assuming we resolve all the issues mentioned earlier.

2. Its only real improvement would be to have the feature available from 
userspace which we doubt can be as performant.
(side note – based on our experience, managing the entire TCP stack from 
userspace will achieve much better performance but this is again out of the 
scope of this discussion)

Based on that, our request would be to review and consider merging this 
series with our original design. 

> 
> >> I'm not sure that such interface could cover everything that is needed,
> >> but what I'm trying to convey, is that the current interface limits the
> >> usability for almost anything else. Please correct me if I'm wrong.
> >> Is this designed to also cater anything else outside of the kernel
> >> nvme-tcp host driver?
> >
> > As part of this series, we are targeting the kernel nvme-tcp host driver,
> > and later we are planning to add support for the kernel nvme-tcp target driver.
> >
> > The ULP layer was designed to be generic for other request-response
> based
> > protocols.
> 
> I understand.
> 
> >
> >>
> >>> Compatibility
> >>> =============
> >>> * The offload works with bare-metal or SRIOV.
> >>> * The HW can support up to 64K connections per device (assuming no
> >>>     other HW accelerations are used). In this series, we will introduce
> >>>     the support for up to 4k connections, and we have plans to increase it.
> >>> * SW TLS could not work together with the NVMeTCP offload as the HW
> >>>     will need to track the NVMeTCP headers in the TCP stream.
> >>
> >> Can't say I like that.
> >
> > The answer should be to support both TLS offload and NVMeTCP offload,
> > which is a HW limit in our current generation.
> 
> I can't complain about this because we don't have TLS supported in
> nvme-tcp. But _can_ they work together on future HW?

Yes. With our original design the effort to support both offloads is incremental.

But, with the new socket API design discussed here, it will be more challenging 
to stack ULPs, and hence doing TLS + NVMe-TCP will be harder with this design.

> 
> >
> >>
> >>> * The ConnectX HW support HW TLS, but in ConnectX-7 those features
> >>>     could not co-exists (and it is not part of this series).
> >>> * The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
> >>>     don’t see the need for this feature yet.
> >>> * NVMe poll queues are not in the scope of this series.
> >>
> >> bonding/teaming?
> >
> > We are planning to add it as part of the "incremental feature".
> > It will follow the existing design of the mlx HW TLS.
> 
> OK.
> 
> >>> Future Work
> >>> ===========
> >>> * NVMeTCP transmit offload.
> >>> * NVMeTCP host offloads incremental features.
> >>> * NVMeTCP target offload.
> >>
> >> Which target? which host?
> >
> > Kernel nvme-tcp host driver and kernel nvme-tcp target driver.
> 
> Look, I did spend time looking into this patchset several times
> in the past, and also provided feedback. The reason why I'm questioning
> the interface now, is because I'm wandering if it can be less intrusive
> and use the common socket interface instead of inventing a new one.

Thanks again for your time looking into this and providing your feedback.
It’s a valid question but we think the current design is the better compromise.