[PATCHv5 1/1] nvme-tcp: Add option to set the physical interface to be used when connecting over TCP sockets.
Belanger, Martin
Martin.Belanger at dell.com
Mon May 17 10:58:38 PDT 2021
> On 5/14/21 4:46 AM, Martin Belanger wrote:
> > From: Martin Belanger <martin.belanger at dell.com>
> >
> > Addressed reviews from PATCHv4.
> >
> > In our application, we need a way to force TCP connections to go out a
> > specific IP interface instead of letting Linux select the interface
> > based on the routing tables. This patch adds the option 'host-iface'
> > to allow specifying the interface to use. Note that corresponding
> > changes to the nvme-cli utility will follow.
> >
> > When the option host-iface is specified, the driver uses the specified
> > interface to set the option SO_BINDTODEVICE on the TCP socket before
> > connecting.
> >
> > This new option is needed in addtion to the existing host-traddr for
> > the following reasons:
> >
> > Specifying an IP interface by its associated IP address is less
> > intuitive than specifying the actual interface name and, in some
> > cases, simply doesn't work. That's because the association between
> > interfaces and IP addresses is not predictable. IP addresses can be
> > changed or can change by themselves over time (e.g. DHCP). Interface
> > names are predictable [1] and will persist over time. Consider the
> > following configuration.
> >
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
> > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > inet 100.0.0.100/24 scope global lo
> > valid_lft forever preferred_lft forever
> > 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
> > link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
> > inet 100.0.0.100/24 scope global enp0s3
> > valid_lft forever preferred_lft forever
> > 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
> > link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
> > inet 100.0.0.100/24 scope global enp0s8
> > valid_lft forever preferred_lft forever
> >
> > The above is a VM that I configured with the same IP address
> > (100.0.0.100) on all interfaces. Doing a reverse lookup to identify
> > the unique interface associated with 100.0.0.100 does not work here.
> > And this is why the option host_iface is required. I understand that
> > the above config does not represent a standard host system, but I'm
> > using this to prove a point: "We can never know how users will
> > configure their systems". By te way, The above configuration is
> > perfectly fine by Linux.
> >
> > The current TCP implementation for host_traddr performs a
> > bind()-before-connect(). This is a common construct to set the source
> > IP address on a TCP socket before connecting. This has no effect on
> > how Linux selects the interface for the connection. That's because
> > Linux uses the Weak End System model as described in RFC1122 [2]. On
> > the other hand, setting the Source IP Address has benefits and should
> > be supported by linux-nvme. In fact, setting the Source IP Address is
> > a mandatory FedGov requirement (e.g. connection to a RADIUS/TACACS+
> server).
> > Consider the following configuration.
> >
> > $ ip addr list dev enp0s8
> > 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
> > link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
> > inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
> > valid_lft 426sec preferred_lft 426sec
> > inet 192.168.56.102/24 scope global secondary enp0s8
> > valid_lft forever preferred_lft forever
> > inet 192.168.56.103/24 scope global secondary enp0s8
> > valid_lft forever preferred_lft forever
> > inet 192.168.56.104/24 scope global secondary enp0s8
> > valid_lft forever preferred_lft forever
> >
> > Here we can see that several addresses are associated with interface
> > enp0s8. By default, Linux always selects the default IP address,
> > 192.168.56.101, as the source address when connecting over interface
> > enp0s8. Some users, however, want the ability to specify a different
> > source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
> > host_traddr can be used as-is to perform this function.
> >
> > In conclusion, I believe that we need 2 options for TCP connections.
> > One that can be used to specify an interface (host-iface). And one
> > that can be used to set the source address (host-traddr). Users should
> > be allowed to use one or the other, or both, or none. Of course, the
> > documentation for host_traddr will need some clarification. It should
> > state that when used for TCP connection, this option only sets the
> > source address. And the documentation for host_iface should say that
> > this option is only available for TCP connections.
> >
> > References:
> > [1]
> > https://urldefense.com/v3/__https://www.freedesktop.org/wiki/Software/
> >
> systemd/*5C__;JQ!!LpKI!y1LboKDzdi3h9GSBpPMjDitsAgdZNF79yQRHNTyIBsL4
> pPh
> > Hd2KRhYktttSLH1g3dg$ [freedesktop[.]org]
> > PredictableNetworkInterfaceNames/ [2]
> > https://urldefense.com/v3/__https://tools.ietf.org/html/rfc1122__;!!Lp
> >
> KI!y1LboKDzdi3h9GSBpPMjDitsAgdZNF79yQRHNTyIBsL4pPhHd2KRhYktttTDLg
> HT0w$
> > [tools[.]ietf[.]org]
> >
> > Tested both IPv4 and IPv6 connections.
> >
> > Signed-off-by: Martin Belanger <martin.belanger at dell.com>
> > ---
> > drivers/nvme/host/core.c | 5 ++++
> > drivers/nvme/host/fabrics.c | 14 +++++++++++
> > drivers/nvme/host/fabrics.h | 6 ++++-
> > drivers/nvme/host/tcp.c | 48
> ++++++++++++++++++++++++++++++++++++-
> > 4 files changed, 71 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
> > 762125f2905f..f98255eb7deb 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -4067,6 +4067,11 @@ static int nvme_class_uevent(struct device
> > *dev, struct kobj_uevent_env *env)
> >
> > ret = add_uevent_var(env, "NVME_HOST_TRADDR=%s",
> > opts->host_traddr ?: "none");
> > + if (ret)
> > + return ret;
> > +
> > + ret = add_uevent_var(env, "NVME_HOST_IFACE=%s",
> > + opts->host_iface ?: "none");
> > }
> > return ret;
> > }
> > diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> > index a2bb7fc63a73..76dc3eaf46f3 100644
> > --- a/drivers/nvme/host/fabrics.c
> > +++ b/drivers/nvme/host/fabrics.c
> > @@ -112,6 +112,9 @@ int nvmf_get_address(struct nvme_ctrl *ctrl, char
> *buf, int size)
> > if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR)
> > len += scnprintf(buf + len, size - len, "%shost_traddr=%s",
> > (len) ? "," : "", ctrl->opts->host_traddr);
> > + if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)
> > + len += scnprintf(buf + len, size - len, "%shost_iface=%s",
> > + (len) ? "," : "", ctrl->opts->host_iface);
> > len += scnprintf(buf + len, size - len, "\n");
> >
> > return len;
> > @@ -545,6 +548,7 @@ static const match_table_t opt_tokens = {
> > { NVMF_OPT_KATO, "keep_alive_tmo=%d" },
> > { NVMF_OPT_HOSTNQN, "hostnqn=%s" },
> > { NVMF_OPT_HOST_TRADDR, "host_traddr=%s" },
> > + { NVMF_OPT_HOST_IFACE, "host_iface=%s"
> },
> > { NVMF_OPT_HOST_ID, "hostid=%s" },
> > { NVMF_OPT_DUP_CONNECT, "duplicate_connect" },
> > { NVMF_OPT_DISABLE_SQFLOW, "disable_sqflow" },
> > @@ -754,6 +758,15 @@ static int nvmf_parse_options(struct
> nvmf_ctrl_options *opts,
> > kfree(opts->host_traddr);
> > opts->host_traddr = p;
> > break;
> > + case NVMF_OPT_HOST_IFACE:
> > + p = match_strdup(args);
> > + if (!p) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + kfree(opts->host_iface);
> > + opts->host_iface = p;
> > + break;
> > case NVMF_OPT_HOST_ID:
> > p = match_strdup(args);
> > if (!p) {
> > @@ -938,6 +951,7 @@ void nvmf_free_options(struct nvmf_ctrl_options
> *opts)
> > kfree(opts->trsvcid);
> > kfree(opts->subsysnqn);
> > kfree(opts->host_traddr);
> > + kfree(opts->host_iface);
> > kfree(opts);
> > }
> > EXPORT_SYMBOL_GPL(nvmf_free_options);
> > diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
> > index d7f7974dc208..c31dad69a773 100644
> > --- a/drivers/nvme/host/fabrics.h
> > +++ b/drivers/nvme/host/fabrics.h
> > @@ -66,6 +66,7 @@ enum {
> > NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
> > NVMF_OPT_TOS = 1 << 19,
> > NVMF_OPT_FAIL_FAST_TMO = 1 << 20,
> > + NVMF_OPT_HOST_IFACE = 1 << 21,
> > };
> >
> > /**
> > @@ -83,7 +84,9 @@ enum {
> > * @trsvcid: The transport-specific TRSVCID field for a port on the
> > * subsystem which is adding a controller.
> > * @host_traddr: A transport-specific field identifying the NVME host port
> > - * to use for the connection to the controller.
> > + * to use for the connection to the controller.
> > + * @host_iface: A transport-specific field identifying the NVME host
> > + * interface to use for the connection to the controller.
> > * @queue_size: Number of IO queue elements.
> > * @nr_io_queues: Number of controller IO queues that will be
> established.
> > * @reconnect_delay: Time between two consecutive reconnect attempts.
> > @@ -108,6 +111,7 @@ struct nvmf_ctrl_options {
> > char *traddr;
> > char *trsvcid;
> > char *host_traddr;
> > + char *host_iface;
> > size_t queue_size;
> > unsigned int nr_io_queues;
> > unsigned int reconnect_delay;
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index
> > 0222e23f5936..32e44527afaa 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -123,6 +123,7 @@ struct nvme_tcp_ctrl {
> > struct blk_mq_tag_set admin_tag_set;
> > struct sockaddr_storage addr;
> > struct sockaddr_storage src_addr;
> > + struct net_device *ndev;
> > struct nvme_ctrl ctrl;
> >
> > struct work_struct err_work;
> > @@ -1454,6 +1455,20 @@ static int nvme_tcp_alloc_queue(struct
> nvme_ctrl *nctrl,
> > }
> > }
> >
> > + if (nctrl->opts->mask & NVMF_OPT_HOST_IFACE) {
> > + char *iface = nctrl->opts->host_iface;
> > + sockptr_t optval = KERNEL_SOCKPTR(iface);
> > +
> > + ret = sock_setsockopt(queue->sock, SOL_SOCKET,
> SO_BINDTODEVICE,
> > + optval, strlen(iface));
> > + if (ret) {
> > + dev_err(nctrl->device,
> > + "failed to bind to interface %s queue %d err %d\n",
> > + iface, qid, ret);
> > + goto err_sock;
> > + }
> > + }
> > +
> > queue->hdr_digest = nctrl->opts->hdr_digest;
> > queue->data_digest = nctrl->opts->data_digest;
> > if (queue->hdr_digest || queue->data_digest) { @@ -2514,6 +2529,37
> > @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
> > }
> > }
> >
> > + if (opts->mask & NVMF_OPT_HOST_IFACE) {
> > + ctrl->ndev = dev_get_by_name(&init_net, opts->host_iface);
> > + if (!ctrl->ndev) {
> > + pr_err("invalid interface passed: %s\n",
> > + opts->host_iface);
> > + ret = -ENODEV;
> > + goto out_free_ctrl;
> > + }
> > + }
> > +
> > + /* For IPv6 HOST_TRADDR, check if scope ID is specified. If not, set
> > + * HOST_TRADDR's scope ID to the scope ID of HOST_IFACE or the
> scope
> > + * ID of TRADDR
> > + */
>
> Not sure I follow what this is doing. the binding is taken from either
> host_traddr, host_iface or traddr? What are the rules here?
Hi Sagi,
They do something similar in ping, but I don’t think it's necessary.
I will remove this code and re-submit the patch.
Martin
>
> > + if ((opts->mask & NVMF_OPT_HOST_TRADDR) &&
> > + (ctrl->src_addr.ss_family == AF_INET6)) {
> > + struct sockaddr_in6 *src_addr6 =
> > + (struct sockaddr_in6 *)&ctrl->src_addr;
> > +
> > + if (src_addr6->sin6_scope_id == 0) {
> > + if (ctrl->ndev && (ctrl->ndev->ifindex != 0)) {
> > + src_addr6->sin6_scope_id = ctrl->ndev-
> >ifindex;
> > + } else {
> > + struct sockaddr_in6 *addr6 =
> > + (struct sockaddr_in6 *)&ctrl->addr;
> > +
> > + src_addr6->sin6_scope_id = addr6-
> >sin6_scope_id;
> > + }
> > + }
> > + }
> > +
> > if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts))
> {
> > ret = -EALREADY;
> > goto out_free_ctrl;
> > @@ -2570,7 +2616,7 @@ static struct nvmf_transport_ops
> nvme_tcp_transport = {
> > NVMF_OPT_HOST_TRADDR |
> NVMF_OPT_CTRL_LOSS_TMO |
> > NVMF_OPT_HDR_DIGEST |
> NVMF_OPT_DATA_DIGEST |
> > NVMF_OPT_NR_WRITE_QUEUES |
> NVMF_OPT_NR_POLL_QUEUES |
> > - NVMF_OPT_TOS,
> > + NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE,
> > .create_ctrl = nvme_tcp_create_ctrl,
> > };
> >
> > --
> > 2.30.2
> >
More information about the Linux-nvme
mailing list