[RFC PATCH v5 01/27] nvme-tcp-offload: Add nvme-tcp-offload - NVMeTCP HW offload ULP

Shai Malin malin1024 at gmail.com
Mon May 24 13:08:36 PDT 2021


On 5/22/21 1:13 AM, Sagi Grimberg wrote:
> On 5/19/21 4:13 AM, Shai Malin wrote:
> > This patch will present the structure for the NVMeTCP offload common
> > layer driver. This module is added under "drivers/nvme/host/" and future
> > offload drivers which will register to it will be placed under
> > "drivers/nvme/hw".
> > This new driver will be enabled by the Kconfig "NVM Express over Fabrics
> > TCP offload common layer".
> > In order to support the new transport type, for host mode, no change is
> > needed.
> >
> > Each new vendor-specific offload driver will register to this ULP during
> > its probe function, by filling out the nvme_tcp_ofld_dev->ops and
> > nvme_tcp_ofld_dev->private_data and calling nvme_tcp_ofld_register_dev
> > with the initialized struct.
> >
> > The internal implementation:
> > - tcp-offload.h:
> >    Includes all common structs and ops to be used and shared by offload
> >    drivers.
> >
> > - tcp-offload.c:
> >    Includes the init function which registers as a NVMf transport just
> >    like any other transport.
> >
> > Acked-by: Igor Russkikh <irusskikh at marvell.com>
> > Signed-off-by: Dean Balandin <dbalandin at marvell.com>
> > Signed-off-by: Prabhakar Kushwaha <pkushwaha at marvell.com>
> > Signed-off-by: Omkar Kulkarni <okulkarni at marvell.com>
> > Signed-off-by: Michal Kalderon <mkalderon at marvell.com>
> > Signed-off-by: Ariel Elior <aelior at marvell.com>
> > Signed-off-by: Shai Malin <smalin at marvell.com>
> > Reviewed-by: Hannes Reinecke <hare at suse.de>
> > ---
> >   MAINTAINERS                     |   8 ++
> >   drivers/nvme/host/Kconfig       |  16 +++
> >   drivers/nvme/host/Makefile      |   3 +
> >   drivers/nvme/host/tcp-offload.c | 126 +++++++++++++++++++
> >   drivers/nvme/host/tcp-offload.h | 212 ++++++++++++++++++++++++++++++++
> >   5 files changed, 365 insertions(+)
> >   create mode 100644 drivers/nvme/host/tcp-offload.c
> >   create mode 100644 drivers/nvme/host/tcp-offload.h
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index bd7aff0c120f..49a4a73ea1c7 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -13092,6 +13092,14 @@ F:   drivers/nvme/host/
> >   F:  include/linux/nvme.h
> >   F:  include/uapi/linux/nvme_ioctl.h
> >
> > +NVM EXPRESS TCP OFFLOAD TRANSPORT DRIVERS
> > +M:   Shai Malin <smalin at marvell.com>
> > +M:   Ariel Elior <aelior at marvell.com>
> > +L:   linux-nvme at lists.infradead.org
> > +S:   Supported
> > +F:   drivers/nvme/host/tcp-offload.c
> > +F:   drivers/nvme/host/tcp-offload.h
> > +
> >   NVM EXPRESS FC TRANSPORT DRIVERS
> >   M:  James Smart <james.smart at broadcom.com>
> >   L:  linux-nvme at lists.infradead.org
> > diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> > index a44d49d63968..6e869e94e67f 100644
> > --- a/drivers/nvme/host/Kconfig
> > +++ b/drivers/nvme/host/Kconfig
> > @@ -84,3 +84,19 @@ config NVME_TCP
> >         from https://github.com/linux-nvme/nvme-cli.
> >
> >         If unsure, say N.
> > +
> > +config NVME_TCP_OFFLOAD
> > +     tristate "NVM Express over Fabrics TCP offload common layer"
> > +     default m
> > +     depends on INET
> > +     depends on BLK_DEV_NVME
>
> This needs to be: select NVME_CORE
>
> In fact, I've sent a patch that fixes that for nvme-tcp..

Thanks, will be fixed.

>
> > +     select NVME_FABRICS
> > +     help
> > +       This provides support for the NVMe over Fabrics protocol using
> > +       the TCP offload transport. This allows you to use remote block devices
> > +       exported using the NVMe protocol set.
> > +
> > +       To configure a NVMe over Fabrics controller use the nvme-cli tool
> > +       from https://github.com/linux-nvme/nvme-cli.
> > +
> > +       If unsure, say N.
> > diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
> > index cbc509784b2e..3c3fdf83ce38 100644
> > --- a/drivers/nvme/host/Makefile
> > +++ b/drivers/nvme/host/Makefile
> > @@ -8,6 +8,7 @@ obj-$(CONFIG_NVME_FABRICS)            += nvme-fabrics.o
> >   obj-$(CONFIG_NVME_RDMA)                     += nvme-rdma.o
> >   obj-$(CONFIG_NVME_FC)                       += nvme-fc.o
> >   obj-$(CONFIG_NVME_TCP)                      += nvme-tcp.o
> > +obj-$(CONFIG_NVME_TCP_OFFLOAD)       += nvme-tcp-offload.o
> >
> >   nvme-core-y                         := core.o ioctl.o
> >   nvme-core-$(CONFIG_TRACING)         += trace.o
> > @@ -26,3 +27,5 @@ nvme-rdma-y                         += rdma.o
> >   nvme-fc-y                           += fc.o
> >
> >   nvme-tcp-y                          += tcp.o
> > +
> > +nvme-tcp-offload-y           += tcp-offload.o
> > diff --git a/drivers/nvme/host/tcp-offload.c b/drivers/nvme/host/tcp-offload.c
> > new file mode 100644
> > index 000000000000..711232eba339
> > --- /dev/null
> > +++ b/drivers/nvme/host/tcp-offload.c
> > @@ -0,0 +1,126 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright 2021 Marvell. All rights reserved.
> > + */
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +/* Kernel includes */
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +
> > +/* Driver includes */
> > +#include "tcp-offload.h"
> > +
> > +static LIST_HEAD(nvme_tcp_ofld_devices);
> > +static DECLARE_RWSEM(nvme_tcp_ofld_devices_rwsem);
>
> Why is that a rwsem?

It was based on nvmf_transports_rwsem.
We will change it to a regular lock.

>
> > +
> > +/**
> > + * nvme_tcp_ofld_register_dev() - NVMeTCP Offload Library registration
> > + * function.
> > + * @dev:     NVMeTCP offload device instance to be registered to the
> > + *           common tcp offload instance.
> > + *
> > + * API function that registers the type of vendor specific driver
> > + * being implemented to the common NVMe over TCP offload library. Part of
> > + * the overall init sequence of starting up an offload driver.
> > + */
> > +int nvme_tcp_ofld_register_dev(struct nvme_tcp_ofld_dev *dev)
> > +{
> > +     struct nvme_tcp_ofld_ops *ops = dev->ops;
> > +
> > +     if (!ops->claim_dev ||
> > +         !ops->setup_ctrl ||
> > +         !ops->release_ctrl ||
> > +         !ops->create_queue ||
> > +         !ops->drain_queue ||
> > +         !ops->destroy_queue ||
> > +         !ops->poll_queue ||
> > +         !ops->init_req ||
> > +         !ops->send_req ||
> > +         !ops->commit_rqs)
> > +             return -EINVAL;
> > +
> > +     down_write(&nvme_tcp_ofld_devices_rwsem);
> > +     list_add_tail(&dev->entry, &nvme_tcp_ofld_devices);
> > +     up_write(&nvme_tcp_ofld_devices_rwsem);
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_tcp_ofld_register_dev);
> > +
> > +/**
> > + * nvme_tcp_ofld_unregister_dev() - NVMeTCP Offload Library unregistration
> > + * function.
> > + * @dev:     NVMeTCP offload device instance to be unregistered from the
> > + *           common tcp offload instance.
> > + *
> > + * API function that unregisters the type of vendor specific driver being
> > + * implemented from the common NVMe over TCP offload library.
> > + * Part of the overall exit sequence of unloading the implemented driver.
> > + */
> > +void nvme_tcp_ofld_unregister_dev(struct nvme_tcp_ofld_dev *dev)
> > +{
> > +     down_write(&nvme_tcp_ofld_devices_rwsem);
> > +     list_del(&dev->entry);
> > +     up_write(&nvme_tcp_ofld_devices_rwsem);
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_tcp_ofld_unregister_dev);
> > +
> > +/**
> > + * nvme_tcp_ofld_report_queue_err() - NVMeTCP Offload report error event
> > + * callback function. Pointed to by nvme_tcp_ofld_queue->report_err.
> > + * @queue:   NVMeTCP offload queue instance on which the error has occurred.
> > + *
> > + * API function that allows the vendor specific offload driver to reports errors
> > + * to the common offload layer, to invoke error recovery.
> > + */
> > +int nvme_tcp_ofld_report_queue_err(struct nvme_tcp_ofld_queue *queue)
> > +{
>
> No semantics into what was the error?

We were following the multiple calls of nvme_tcp_error_recovery() in tcp.c and
nvme_rdma_error_recovery() in rdma.c where the controller reset flow is called
for any error type. We were planning to do something similar and the error
type doesn’t seem to be part of it.

>
> > +     /* Placeholder - invoke error recovery flow */
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * nvme_tcp_ofld_req_done() - NVMeTCP Offload request done callback
> > + * function. Pointed to by nvme_tcp_ofld_req->done.
> > + * Handles both NVME_TCP_F_DATA_SUCCESS flag and NVMe CQ.
> > + * @req:     NVMeTCP offload request to complete.
> > + * @result:     The nvme_result.
> > + * @status:     The completion status.
> > + *
> > + * API function that allows the vendor specific offload driver to report request
> > + * completions to the common offload layer.
> > + */
> > +void nvme_tcp_ofld_req_done(struct nvme_tcp_ofld_req *req,
> > +                         union nvme_result *result,
> > +                         __le16 status)
> > +{
>
> Why do you need to pass back the result? Isn't that a part
> of the request?

The result is part of the request only after calling nvme_try_complete_req().
Both nvme_try_complete_req() and nvme_complete_rq() should be called
from the ULP layer.

>
> > +     /* Placeholder - complete request with/without error */
> > +}
> > +
> > +static struct nvmf_transport_ops nvme_tcp_ofld_transport = {
> > +     .name           = "tcp_offload",
> > +     .module         = THIS_MODULE,
> > +     .required_opts  = NVMF_OPT_TRADDR,
> > +     .allowed_opts   = NVMF_OPT_TRSVCID | NVMF_OPT_NR_WRITE_QUEUES  |
> > +                       NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
> > +                       NVMF_OPT_RECONNECT_DELAY | NVMF_OPT_HDR_DIGEST |
> > +                       NVMF_OPT_DATA_DIGEST | NVMF_OPT_NR_POLL_QUEUES |
> > +                       NVMF_OPT_TOS,
> > +};
> > +
> > +static int __init nvme_tcp_ofld_init_module(void)
> > +{
> > +     nvmf_register_transport(&nvme_tcp_ofld_transport);
> > +
> > +     return 0;
> > +}
> > +
> > +static void __exit nvme_tcp_ofld_cleanup_module(void)
> > +{
> > +     nvmf_unregister_transport(&nvme_tcp_ofld_transport);
> > +}
> > +
> > +module_init(nvme_tcp_ofld_init_module);
> > +module_exit(nvme_tcp_ofld_cleanup_module);
> > +MODULE_LICENSE("GPL v2");
> > diff --git a/drivers/nvme/host/tcp-offload.h b/drivers/nvme/host/tcp-offload.h
> > new file mode 100644
> > index 000000000000..949132ce2ed4
> > --- /dev/null
> > +++ b/drivers/nvme/host/tcp-offload.h
> > @@ -0,0 +1,212 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright 2021 Marvell. All rights reserved.
> > + */
> > +
> > +/* Linux includes */
> > +#include <linux/dma-mapping.h>
> > +#include <linux/scatterlist.h>
> > +#include <linux/types.h>
> > +#include <linux/nvme-tcp.h>
> > +
> > +/* Driver includes */
> > +#include "nvme.h"
> > +#include "fabrics.h"
> > +
> > +/* Forward declarations */
> > +struct nvme_tcp_ofld_ops;
> > +
> > +/* Representation of a vendor-specific device. This is the struct used to
> > + * register to the offload layer by the vendor-specific driver during its probe
> > + * function.
> > + * Allocated by vendor-specific driver.
> > + */
> > +struct nvme_tcp_ofld_dev {
> > +     struct list_head entry;
> > +     struct net_device *ndev;
> > +     struct nvme_tcp_ofld_ops *ops;
> > +
> > +     /* Vendor specific driver context */
> > +     void *private_data;
>
> Usually there is private_data pointer when the upper layer
> passes a struct to the lower layer to initialize, that is
> not the case here, why don't the device driver just use
> container_of to access its internal representation?

Thanks, will be fixed.

>
> > +     int num_hw_vectors;
> > +};
> > +
> > +/* Per IO struct holding the nvme_request and command
> > + * Allocated by blk-mq.
> > + */
> > +struct nvme_tcp_ofld_req {
> > +     struct nvme_request req;
> > +     struct nvme_command nvme_cmd;
> > +     struct list_head queue_entry;
> > +     struct nvme_tcp_ofld_queue *queue;
> > +     struct request *rq;
>
> Why is there an explicit rq pointer? there are converters from
> rq to driver req struct and vice-versa.

Will be fixed.

>
> > +
> > +     /* Vendor specific driver context */
> > +     void *private_data;
> > +
> > +     bool async;
> > +     bool last;
>
> Undocumented and unclear why these are needed.

Regarding the async, we need it for the vendor driver to use the common
send_req() for both the regular IO flow and the async flow. We will add
a comment to explain it.

We will remove the last.

>
> > +
> > +     void (*done)(struct nvme_tcp_ofld_req *req,
> > +                  union nvme_result *result,
> > +                  __le16 status);
> > +};
> > +
> > +enum nvme_tcp_ofld_queue_flags {
> > +     NVME_TCP_OFLD_Q_ALLOCATED = 0,
> > +     NVME_TCP_OFLD_Q_LIVE = 1,
> > +};
> > +
> > +/* Allocated by nvme_tcp_ofld */
> > +struct nvme_tcp_ofld_queue {
> > +     /* Offload device associated to this queue */
> > +     struct nvme_tcp_ofld_dev *dev;
> > +     struct nvme_tcp_ofld_ctrl *ctrl;
> > +     unsigned long flags;
> > +     size_t cmnd_capsule_len;
> > +
> > +     u8 hdr_digest;
> > +     u8 data_digest;
> > +     u8 tos;
> > +
> > +     /* Vendor specific driver context */
> > +     void *private_data;
> > +
> > +     /* Error callback function */
> > +     int (*report_err)(struct nvme_tcp_ofld_queue *queue);
> > +};
> > +
> > +/* Connectivity (routing) params used for establishing a connection */
> > +struct nvme_tcp_ofld_ctrl_con_params {
> > +     /* Input params */
> > +     struct sockaddr_storage remote_ip_addr;
> > +
> > +     /* If NVMF_OPT_HOST_TRADDR is provided it will be set in local_ip_addr
> > +      * in nvme_tcp_ofld_create_ctrl().
> > +      * If NVMF_OPT_HOST_TRADDR is not provided the local_ip_addr will be
> > +      * initialized by claim_dev().
> > +      */
> > +     struct sockaddr_storage local_ip_addr;
> > +
> > +     /* Output params */
> > +     struct sockaddr remote_mac_addr;
> > +     struct sockaddr local_mac_addr;
> > +     u16 vlan_id;
>
> Why should a ULP care about this? It's a red-flag to
> me that a tcp ulp needs these params.

Right. TCP ULP doesn’t need these "Output params".
We will remove it.

>
> > +};
> > +
> > +/* Allocated by nvme_tcp_ofld */
> > +struct nvme_tcp_ofld_ctrl {
> > +     struct nvme_ctrl nctrl;
> > +     struct list_head list;
> > +     struct nvme_tcp_ofld_dev *dev;
> > +
> > +     /* admin and IO queues */
> > +     struct blk_mq_tag_set tag_set;
> > +     struct blk_mq_tag_set admin_tag_set;
> > +     struct nvme_tcp_ofld_queue *queues;
> > +
> > +     struct work_struct err_work;
> > +     struct delayed_work connect_work;
> > +
> > +     /*
> > +      * Each entry in the array indicates the number of queues of
> > +      * corresponding type.
> > +      */
> > +     u32 io_queues[HCTX_MAX_TYPES];
>
> What if the offload device doesn't support poll queue map?

The nvme-tcp-offload should support any queue type.
Each vendor driver shall register with the vendor specific allowed ops,
and it’s allowed to not support poll queue. In that case, the io_queues[]
will include only the supported queues.

>
> > +
> > +     /* Connectivity params */
> > +     struct nvme_tcp_ofld_ctrl_con_params conn_params;
> > +
> > +     /* Vendor specific driver context */
> > +     void *private_data;
> > +};
> > +
> > +struct nvme_tcp_ofld_ops {
> > +     const char *name;
> > +     struct module *module;
> > +
> > +     /* For vendor-specific driver to report what opts it supports */
> > +     int required_opts; /* bitmap using enum nvmf_parsing_opts */
> > +     int allowed_opts; /* bitmap using enum nvmf_parsing_opts */
>
> What is the difference between this one and the ulp one?

This one is for the specific vendor driver (and it could be different between
different vendor drivers), and the nvme-tcp-offload ops should support
any opts.
It will be used in patch 4 – “nvme-tcp-offload: Add controller level
implementation” in tcp-offload.c in nvme_tcp_ofld_check_dev_opts().
We will improve the documentation.

>
> > +
> > +     /* For vendor-specific max num of segments and IO sizes */
> > +     u32 max_hw_sectors;
> > +     u32 max_segments;
>
> Understand max_segments maybe needed, but why max_hw_sectors? Is
> that something an offload device really cares about?

Yes. Offload devices also might have a max_hw_sectors limitation.

>
> > +
> > +     /**
> > +      * claim_dev: Return True if addr is reachable via offload device.
> > +      * @dev: The offload device to check.
> > +      * @conn_params: ptr to routing params to be filled by the lower
> > +      *               driver. Input+Output argument.
> > +      */
> > +     int (*claim_dev)(struct nvme_tcp_ofld_dev *dev,
> > +                      struct nvme_tcp_ofld_ctrl_con_params *conn_params);
> > +
> > +     /**
> > +      * setup_ctrl: Setup device specific controller structures.
> > +      * @ctrl: The offload ctrl.
> > +      * @new: is new setup.
> > +      */
> > +     int (*setup_ctrl)(struct nvme_tcp_ofld_ctrl *ctrl, bool new);
>
> I think that the 'new' is really an odd interface, it's ok if its
> internal to a driver, but not for an interface...

We will remove the “new” and instead we will check the ctrl->private_data.

>
> > +
> > +     /**
> > +      * release_ctrl: Release/Free device specific controller structures.
> > +      * @ctrl: The offload ctrl.
> > +      */
> > +     int (*release_ctrl)(struct nvme_tcp_ofld_ctrl *ctrl);
> > +
> > +     /**
> > +      * create_queue: Create offload queue and establish TCP + NVMeTCP
> > +      * (icreq+icresp) connection. Return true on successful connection.
> > +      * Based on nvme_tcp_alloc_queue.
> > +      * @queue: The queue itself - used as input and output.
> > +      * @qid: The queue ID associated with the requested queue.
> > +      * @q_size: The queue depth.
> > +      */
> > +     int (*create_queue)(struct nvme_tcp_ofld_queue *queue, int qid,
> > +                         size_t q_size);
>
> queue_size

Will be fixed.

>
> > +
> > +     /**
> > +      * drain_queue: Drain a given queue - Returning from this function
> > +      * ensures that no additional completions will arrive on this queue.
> > +      * @queue: The queue to drain.
> > +      */
> > +     void (*drain_queue)(struct nvme_tcp_ofld_queue *queue);
>
> I'm assuming this is a blocking call? should probably document it.

It could be a blocking call, we will document it.

>
> > +
> > +     /**
> > +      * destroy_queue: Close the TCP + NVMeTCP connection of a given queue
> > +      * and make sure its no longer active (no completions will arrive on the
> > +      * queue).
> > +      * @queue: The queue to destroy.
> > +      */
> > +     void (*destroy_queue)(struct nvme_tcp_ofld_queue *queue);
> > +
> > +     /**
> > +      * poll_queue: Poll a given queue for completions.
> > +      * @queue: The queue to poll.
> > +      */
> > +     int (*poll_queue)(struct nvme_tcp_ofld_queue *queue);
> > +
> > +     /**
> > +      * init_req: Initialize vendor-specific params for a new request.
> > +      * @req: Ptr to request to be initialized. Input+Output argument.
> > +      */
> > +     int (*init_req)(struct nvme_tcp_ofld_req *req);
>
> What is this used for?

It was added for vendor drivers which will need to initialize resources
for the request.
It’s not in use with the qedn driver – we will remove it.

>
> > +
> > +     /**
> > +      * send_req: Dispatch a request. Returns the execution status.
> > +      * @req: Ptr to request to be sent.
> > +      */
> > +     int (*send_req)(struct nvme_tcp_ofld_req *req);
> > +
> > +     /**
> > +      * commit_rqs: Serves the purpose of kicking the hardware in case of
> > +      * errors, otherwise it would have been kicked by the last request.
> > +      * @queue: The queue to drain.
> > +      */
> > +     void (*commit_rqs)(struct nvme_tcp_ofld_queue *queue);
>
> Is this something you actually use? And the documentation talks about
> errors which doesn't match the name, not sure what is the purpose of
> this at all.

This is for vendor driver which might need it.
It’s not in use with the qedn driver – we will remove it.



More information about the Linux-nvme mailing list