[PATCHv4 1/1] nvme-tcp: Add option to set the physical interface to be used when connecting over TCP sockets.

Hannes Reinecke hare at suse.de
Fri May 14 02:19:05 PDT 2021


On 5/13/21 9:18 PM, Martin Belanger wrote:
> From: Martin Belanger <martin.belanger at dell.com>
> 
> In our application, we need a way to force TCP connections to go out a
> specific IP interface instead of letting Linux select the interface
> based on the routing tables. This patch adds the option 'host-iface'
> to allow specifying the interface to use. Note that corresponding
> changes to the nvme-cli utility will follow.
> 
> When the option host-iface is specified, the driver uses the specified
> interface to set the option SO_BINDTODEVICE on the TCP socket before
> connecting.
> 
> This new option is needed in addtion to the existing host-traddr for
> the following reasons:
> 
> Specifying an IP interface by its associated IP address is less
> intuitive than specifying the actual interface name and, in some cases,
> simply doesn't work. That's because the association between interfaces
> and IP addresses is not predictable. IP addresses can be changed or can
> change by themselves over time (e.g. DHCP). Interface names are
> predictable [1] and will persist over time. Consider the following
> configuration.
> 
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 100.0.0.100/24 scope global lo
>        valid_lft forever preferred_lft forever
> 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
>     link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
>     inet 100.0.0.100/24 scope global enp0s3
>        valid_lft forever preferred_lft forever
> 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
>     link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
>     inet 100.0.0.100/24 scope global enp0s8
>        valid_lft forever preferred_lft forever
> 
> The above is a VM that I configured with the same IP address
> (100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
> unique interface associated with 100.0.0.100 does not work here. And
> this is why the option host_iface is required. I understand that the
> above config does not represent a standard host system, but I'm using
> this to prove a point: "We can never know how users will configure
> their systems". By te way, The above configuration is perfectly fine
> by Linux.
> 
> The current TCP implementation for host_traddr performs a
> bind()-before-connect(). This is a common construct to set the source
> IP address on a TCP socket before connecting. This has no effect on how
> Linux selects the interface for the connection. That's because Linux
> uses the Weak End System model as described in RFC1122 [2]. On the other
> hand, setting the Source IP Address has benefits and should be supported
> by linux-nvme. In fact, setting the Source IP Address is a mandatory
> FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
> Consider the following configuration.
> 
> $ ip addr list dev enp0s8
> 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
>     link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
>     inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
>        valid_lft 426sec preferred_lft 426sec
>     inet 192.168.56.102/24 scope global secondary enp0s8
>        valid_lft forever preferred_lft forever
>     inet 192.168.56.103/24 scope global secondary enp0s8
>        valid_lft forever preferred_lft forever
>     inet 192.168.56.104/24 scope global secondary enp0s8
>        valid_lft forever preferred_lft forever
> 
> Here we can see that several addresses are associated with interface
> enp0s8. By default, Linux always selects the default IP address,
> 192.168.56.101, as the source address when connecting over interface
> enp0s8. Some users, however, want the ability to specify a different
> source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
> host_traddr can be used as-is to perform this function.
> 
> In conclusion, I believe that we need 2 options for TCP connections.
> One that can be used to specify an interface (host-iface). And one that
> can be used to set the source address (host-traddr). Users should be
> allowed to use one or the other, or both, or none. Of course, the
> documentation for host_traddr will need some clarification. It should
> state that when used for TCP connection, this option only sets the
> source address. And the documentation for host_iface should say that
> this option is only available for TCP connections.
> 
> References:
> [1] https://www.freedesktop.org/wiki/Software/systemd/\
> PredictableNetworkInterfaceNames/
> [2] https://tools.ietf.org/html/rfc1122
> 
> Tested both IPv4 and IPv6 connections.
> 
> Signed-off-by: Martin Belanger <martin.belanger at dell.com>
> ---
>  drivers/nvme/host/core.c    |  5 ++++
>  drivers/nvme/host/fabrics.c | 14 ++++++++++
>  drivers/nvme/host/fabrics.h |  6 ++++-
>  drivers/nvme/host/tcp.c     | 54 ++++++++++++++++++++++++++++++++++---
>  4 files changed, 75 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 762125f2905f..f98255eb7deb 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4067,6 +4067,11 @@ static int nvme_class_uevent(struct device *dev, struct kobj_uevent_env *env)
> 
>  		ret = add_uevent_var(env, "NVME_HOST_TRADDR=%s",
>  				opts->host_traddr ?: "none");
> +		if (ret)
> +			return ret;
> +
> +		ret = add_uevent_var(env, "NVME_HOST_IFACE=%s",
> +				opts->host_iface ?: "none");
>  	}
>  	return ret;
>  }
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index a2bb7fc63a73..76dc3eaf46f3 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -112,6 +112,9 @@ int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
>  	if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR)
>  		len += scnprintf(buf + len, size - len, "%shost_traddr=%s",
>  				(len) ? "," : "", ctrl->opts->host_traddr);
> +	if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE)
> +		len += scnprintf(buf + len, size - len, "%shost_iface=%s",
> +				(len) ? "," : "", ctrl->opts->host_iface);
>  	len += scnprintf(buf + len, size - len, "\n");
> 
>  	return len;
> @@ -545,6 +548,7 @@ static const match_table_t opt_tokens = {
>  	{ NVMF_OPT_KATO,		"keep_alive_tmo=%d"	},
>  	{ NVMF_OPT_HOSTNQN,		"hostnqn=%s"		},
>  	{ NVMF_OPT_HOST_TRADDR,		"host_traddr=%s"	},
> +	{ NVMF_OPT_HOST_IFACE,		"host_iface=%s"		},
>  	{ NVMF_OPT_HOST_ID,		"hostid=%s"		},
>  	{ NVMF_OPT_DUP_CONNECT,		"duplicate_connect"	},
>  	{ NVMF_OPT_DISABLE_SQFLOW,	"disable_sqflow"	},
> @@ -754,6 +758,15 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>  			kfree(opts->host_traddr);
>  			opts->host_traddr = p;
>  			break;
> +		case NVMF_OPT_HOST_IFACE:
> +			p = match_strdup(args);
> +			if (!p) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			kfree(opts->host_iface);
> +			opts->host_iface = p;
> +			break;
>  		case NVMF_OPT_HOST_ID:
>  			p = match_strdup(args);
>  			if (!p) {
> @@ -938,6 +951,7 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
>  	kfree(opts->trsvcid);
>  	kfree(opts->subsysnqn);
>  	kfree(opts->host_traddr);
> +	kfree(opts->host_iface);
>  	kfree(opts);
>  }
>  EXPORT_SYMBOL_GPL(nvmf_free_options);
> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
> index d7f7974dc208..c31dad69a773 100644
> --- a/drivers/nvme/host/fabrics.h
> +++ b/drivers/nvme/host/fabrics.h
> @@ -66,6 +66,7 @@ enum {
>  	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
>  	NVMF_OPT_TOS		= 1 << 19,
>  	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
> +	NVMF_OPT_HOST_IFACE	= 1 << 21,
>  };
> 
>  /**
> @@ -83,7 +84,9 @@ enum {
>   * @trsvcid:	The transport-specific TRSVCID field for a port on the
>   *              subsystem which is adding a controller.
>   * @host_traddr: A transport-specific field identifying the NVME host port
> - *              to use for the connection to the controller.
> + *     to use for the connection to the controller.
> + * @host_iface: A transport-specific field identifying the NVME host
> + *     interface to use for the connection to the controller.
>   * @queue_size: Number of IO queue elements.
>   * @nr_io_queues: Number of controller IO queues that will be established.
>   * @reconnect_delay: Time between two consecutive reconnect attempts.

Please stick with the existing indentation.

> @@ -108,6 +111,7 @@ struct nvmf_ctrl_options {
>  	char			*traddr;
>  	char			*trsvcid;
>  	char			*host_traddr;
> +	char			*host_iface;
>  	size_t			queue_size;
>  	unsigned int		nr_io_queues;
>  	unsigned int		reconnect_delay;
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0222e23f5936..3e91bd13d385 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -123,6 +123,7 @@ struct nvme_tcp_ctrl {
>  	struct blk_mq_tag_set	admin_tag_set;
>  	struct sockaddr_storage addr;
>  	struct sockaddr_storage src_addr;
> +	struct net_device	*ndev;
>  	struct nvme_ctrl	ctrl;
> 
>  	struct work_struct	err_work;
> @@ -1454,6 +1455,20 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>  		}
>  	}
> 
> +	if (nctrl->opts->mask & NVMF_OPT_HOST_IFACE) {
> +		char *iface = nctrl->opts->host_iface;
> +		sockptr_t optval = KERNEL_SOCKPTR(iface);
> +
> +		ret = sock_setsockopt(queue->sock, SOL_SOCKET, SO_BINDTODEVICE,
> +				      optval, strlen(iface));
> +		if (ret) {
> +			dev_err(nctrl->device,
> +			  "failed to bind to interface %s queue %d err %d\n",
> +			  iface, qid, ret);
> +			goto err_sock;
> +		}
> +	}
> +
>  	queue->hdr_digest = nctrl->opts->hdr_digest;
>  	queue->data_digest = nctrl->opts->data_digest;
>  	if (queue->hdr_digest || queue->data_digest) {
> @@ -2467,6 +2482,7 @@ nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts)
>  static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
>  		struct nvmf_ctrl_options *opts)
>  {
> +	char addr[256];
>  	struct nvme_tcp_ctrl *ctrl;
>  	int ret;
> 

Please don't allocate large elements on the stack; kernel stack size is
quite limited.

> @@ -2514,6 +2530,37 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
>  		}
>  	}
> 
> +	if (opts->mask & NVMF_OPT_HOST_IFACE) {
> +		ctrl->ndev = dev_get_by_name(&init_net, opts->host_iface);
> +		if (!ctrl->ndev) {
> +			pr_err("invalid interface passed: %s\n",
> +			       opts->host_iface);
> +			ret = -ENODEV;
> +			goto out_free_ctrl;
> +		}
> +	}
> +
> +	/* For IPv6 HOST_TRADDR, check if scope ID is specified. If not, set
> +	 * HOST_TRADDR's scope ID to the scope ID of HOST_IFACE or the scope
> +	 * ID of TRADDR
> +	 */
> +	if ((opts->mask & NVMF_OPT_HOST_TRADDR) &&
> +	    (ctrl->src_addr.ss_family == AF_INET6)) {
> +		struct sockaddr_in6 *src_addr6 =
> +			(struct sockaddr_in6 *)&ctrl->src_addr;
> +
> +		if (src_addr6->sin6_scope_id == 0) {
> +			if (ctrl->ndev && (ctrl->ndev->ifindex != 0)) {
> +				src_addr6->sin6_scope_id = ctrl->ndev->ifindex;
> +			} else {
> +				struct sockaddr_in6 *addr6 =
> +					(struct sockaddr_in6 *)&ctrl->addr;
> +
> +				src_addr6->sin6_scope_id = addr6->sin6_scope_id;
> +			}
> +		}
> +	}
> +
>  	if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) {
>  		ret = -EALREADY;
>  		goto out_free_ctrl;
> @@ -2540,8 +2587,9 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
>  	if (ret)
>  		goto out_uninit_ctrl;
> 
> -	dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
> -		ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
> +	nvmf_get_address(&ctrl->ctrl, addr, sizeof(addr)-1);
> +	dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %s\n",
> +		 ctrl->ctrl.opts->subsysnqn, addr);
> 

I don't think that is what we want. The 'addr' string is the entire
'connect' address, and can get quite longish.
So please just print the IPv4/IPv6 destination address here; any other
information should be glanced from the 'address' sysfs attribute.

>  	mutex_lock(&nvme_tcp_ctrl_mutex);
>  	list_add_tail(&ctrl->list, &nvme_tcp_ctrl_list);
> @@ -2570,7 +2618,7 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
>  			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
>  			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST |
>  			  NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
> -			  NVMF_OPT_TOS,
> +			  NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE,
>  	.create_ctrl	= nvme_tcp_create_ctrl,
>  };
> 

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare at suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)



More information about the Linux-nvme mailing list