[PATCH 1/1] Add 'Transport Interface' (triface) option. This can be used to specify the IP interface to use for the connection. The driver uses that to set SO_BINDTODEVICE on the socket before connecting.

Belanger, Martin Martin.Belanger at dell.com
Mon May 10 14:49:31 BST 2021


> On 5/6/21 8:46 AM, Belanger, Martin wrote:
> >> On 5/6/21 8:05 AM, Hannes Reinecke wrote:
> >>> On 5/5/21 4:31 PM, Belanger, Martin wrote:
> >> [ .. ]
> >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state
> >> UNKNOWN
> >>>> group default qlen 1000
> >>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> >>>>       inet 100.0.0.100/24 scope global lo
> >>>>          valid_lft forever preferred_lft forever
> >>>> 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> >> fq_codel
> >>>> state UP group default qlen 1000
> >>>>       link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
> >>>>       inet 100.0.0.100/24 scope global enp0s3
> >>>>          valid_lft forever preferred_lft forever
> >>>> 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> >> fq_codel
> >>>> state UP group default qlen 1000
> >>>>       link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
> >>>>       inet 100.0.0.100/24 scope global enp0s8
> >>>>          valid_lft forever preferred_lft forever
> >>>>
> >>>> The above is a VM that I configured with the same IP address
> >>>> (100.0.0.100) on all interfaces. Doing a reverse lookup to identify
> >>>> the unique interface associated with 100.0.0.100 would simply not
> >>>> work here. And this is why the option host_iface is required. I
> >>>> understand that the above config does not represent a standard host
> >>>> system, but I'm using this to prove a point: "we can never know how
> >>>> a user will configure their system and the above configuration is
> >>>> perfectly fine by Linux".
> >>>>
> >>>
> >>> ... and messing up any switch MAC address caching when doing so. I
> >>> guess the network admin will come down hard on you if you try that
> >>> on a production system.
> >>> And I sincerely question whether this is a valid use-case; I'm
> >>> already getting grief from our network admins if I dare to put two
> >>> network interfaces from the same machine in the same network.
> >>>
> >>>> The current TCP implementation for host_traddr uses
> >>>> bind()-before-connect(). This is a common construct to set the
> >>>> source IP address on the socket before connecting. This has no
> >>>> effect on how Linux will select the interface for the connection.
> >>>> That's because Linux uses the Weak End System model as described in
> RFC1122 [2].
> >>>> Setting the source address on a connection is a common requirement
> >>>> that linux-nvme needs to support. In fact, specifying the Source IP
> >>>> address is a mandatory FedGov requirement (e.g. connection to a
> >>>> RADIUS/TACACS+ server). Consider the following configuration.
> >>>>
> >>>> $ ip addr list dev enp0s8
> >>>> 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> >> fq_codel
> >>>> state UP group default qlen 1000
> >>>>       link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
> >>>>       inet 192.168.56.101/24 brd 192.168.56.255 scope global
> >>>> dynamic noprefixroute enp0s8
> >>>>          valid_lft 426sec preferred_lft 426sec
> >>>>       inet 192.168.56.102/24 scope global secondary enp0s8
> >>>>          valid_lft forever preferred_lft forever
> >>>>       inet 192.168.56.103/24 scope global secondary enp0s8
> >>>>          valid_lft forever preferred_lft forever
> >>>>       inet 192.168.56.104/24 scope global secondary enp0s8
> >>>>          valid_lft forever preferred_lft forever
> >>>>
> >>>> Here we can see that several addresses are associated with
> >>>> interface enp0s8. By default, Linux will select the default IP
> >>>> address, 192.168.56.101, as the source address when connecting over
> >>>> interface enp0s8. Some users, however, want the ability to specify
> >>>> a different address (e.g.,
> >>>> 192.168.56.103) to be used as the source address.
> >>>> The option host_traddr can be used as-is to perform this function
> >>>> (I tested it).
> >>>>
> >>>
> >>> No disagreement here.
> >>>
> >>>> In conclusion, I believe that for TCP we need 2 options. One that
> >>>> can be used to specify an interface. And one that can be used to
> >>>> set the source address. And users should be allowed to use one or
> >>>> the other, or both, or none.
> >>>> Of course, the documentation for host_traddr will need some
> >>>> clarification. It should state that when used for TCP connection,
> >>>> this option only sets the source address. And the documentation for
> >>>> host_iface should say that this option only applies to TCP
> >>>> connections.
> >>>>
> >>>
> >>> I'm with James Smart here. I do fail to see the need for 'host_iface'
> >>> _without_ 'host_traddr'; especially for IPv6 where several addresses
> >>> are standard just specifying 'host_iface' simply is not enough, and
> >>> one has to specify 'host_traddr' additionally.
> >>>
> >>> So 'host_iface' should be contingent on 'host_traddr', meaning we
> >>> can just expand the syntax of 'host_traddr'.
> >>> One easy possibility would be to add ',nobind' to the host_traddr
> >>> syntax which would indicate that we should _not_ bind to the
> >>> underlying interface; I do think that binding to the respective
> >>> interface should be the default.
> >>>
> >> A-ha. Just spoke to our network folks, and they clarified the usage
> >> of binding to an IP address vs binding to a network interface.
> >> Apparently, binding to a source IP address does just that, setting
> >> the source IP address of the outgoing packet. That packet will
> >> _still_ be subjected to the normal routing table, as the routing
> >> table is just influenced by the _destination_ IP address.
> >> So if we want to have it routed via a specific interface (and thereby
> >> influencing the routing table) we need to bind it to that interface.
> >>
> >> The only valid scenario our network folks could come up with where we
> >> do _not_ want to bind to an interface is for asymmetric flows, ie in
> >> cases where the outgoing flow is routed to one interface and the
> >> incoming flow is arriving on another interface. But even they
> >> admitted that it's not a common scenario, and probably will be killed
> >> by anti-spoofing software running on the core switches ...
> >>
> >> But if we want to support _that_ then clearly binding to a specific
> >> interface doesn't work.
> >>
> >> So I would vote for making binding to the network interface holding
> >> the IP address the default, and add an option ',nobind' to host_traddr to
> skip it.
> >>
> >> Cheers,
> >>
> >> Hannes
> >> --
> >> Dr. Hannes Reinecke		        Kernel Storage Architect
> >> hare at suse.de			               +49 911 74053 688
> >> SUSE Software Solutions Germany GmbH, 90409 Nürnberg
> >> GF: F. Imendörffer, HRB 36809 (AG Nürnberg)
> >
> > Hi Hannes,
> >
> > If the only concern here is the addition of yet another option (--host-iface),
> then may I suggest a simpler approach. What I'm proposing adheres to
> RFC4007 [1], which defines a way to specify an interface by using the '%'
> delimiter between the Destination IP address and the Interface. In fact,
> "ping" uses this approach [2]. With ping, one can force the connection to go
> a specific interface like this:
> >
> > ping <dest-ip-addr>%<interface>
> 
> Ping only supports this syntax for IPv6 no?
> 
> > Extending this approach to nvme-cli we arrive to something like this:
> >
> > nvme discover --traddr 100.64.29.2%enp0s8 --host-traddr 192.168.56.102
> ....
> 
> We already support this for IPv6, we can do that also for IPv4, but this syntax
> may not be trivially expected for ipv4?

I tried this for IPv6 and it doesn't work. Here's what I get:
$ sudo nvme discover -g -G -t tcp -s 8009 -a fe80::800:27ff:fe00:0
Failed to write to /dev/nvme-fabrics: Invalid argument
$ sudo nvme discover -g -G -t tcp -s 8009 -a fe80::800:27ff:fe00:0%enp0s8
Failed to write to /dev/nvme-fabrics: Invalid argument
$ sudo nvme discover -g -G -t tcp -s 8009 -a [fe80::800:27ff:fe00:0]
failed to resolve host [fe80::800:27ff:fe00:0] info
$ sudo nvme discover -g -G -t tcp -s 8009 -a [fe80::800:27ff:fe00:0%enp0s8]
failed to resolve host [fe80::800:27ff:fe00:0%enp0s8] info

> 
> > This tells nvme to connect to 100.64.29.2 on interface enp0s8. We make no
> change to the --host-traddr option. It continues to be used to specify the
> Source IP address only (for the rare cases where users want to specify a
> Source Address other than the default). With this, the interface is specified
> by name and not by its associated address. This is not only more intuitive,
> but, as I stated before, eliminates the problem caused by mapping the same
> IP address to multiple interfaces (not to mention that doing a reverse lookup
> on an IP address to find the interface is extra work that we don’t need to do
> in kernel space).
> 
> Maybe we do something like ping -I for host_traddr, from ping man pages:
> 
> -I interface
>             interface is either an address, an interface name or a VRF name. If
> interface is an address, it sets source address to specified interface address.
> If interface is an
>             interface name, it sets source interface to specified interface. If
> interface is a VRF name, each packet is routed using the corresponding
> routing table; in this case, the -I
>             option can be repeated to specify a source address. NOTE:
> For IPv6, when doing ping to a link-local scope address, link specification (by
> the '%'-notation in destination, or
>             by this option) can be used but it is no longer required.
> 
> 
> Without the repetition though, unless we need to support two interfaces
> that share the same multiple addresses in the same subnet, which sounds
> completely crazy to me...

Hi Sagi,

If we want to follow ping as an example, the repetition is needed not to specify two interfaces, but to specify an interface and the source address. In a previous example (reproduced below), I described a configuration where an interface had several addresses assigned to it. By default, Linux always picks the same Source address (i.e. 192.168.56.101 in this example) when connecting. If a user wants a different source address they need a way to specify it (currently with --host-traddr). Users also need a way to specify an interface separately from the source address (either with a new option like --host-iface or by repeating --host-traddr). With the example below, if we wanted to force ping to use interface enp0s8 and source address 192.168.56.103, we would repeat the -I option, for example "ping -I enp0s8 -I 192.168.56.103". We need a way to do the same with nvme-cli. 

I thought that introducing a new option, "--host-iface", had the smallest impact since it requires less code changes, but that was turned down (not sure exactly why). I then suggested that we use the '%' delimiter for IPv4 and IPv6. I agree that it is not 100% the same as ping since ping only allows the '%' delimiter for IPv6 addresses (as per RFC4007). As you suggested, we could repeat the --host-traddr option (e.g. --host-traddr enp0s8 --host-traddr 192.168.56.103), but this is more impactful to the code than adding a separate --host-iface option.

EXAMPLE: Interface with several addresses assigned:
$ ip addr list dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
      link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
      inet 192.168.56.101/24 brd 192.168.56.255 scope ...
         valid_lft 426sec preferred_lft 426sec
      inet 192.168.56.102/24 scope global secondary enp0s8
         valid_lft forever preferred_lft forever
      inet 192.168.56.103/24 scope global secondary enp0s8
         valid_lft forever preferred_lft forever
      inet 192.168.56.104/24 scope global secondary enp0s8
         valid_lft forever preferred_lft forever

In the end, it doesn't really matter (to me) how it is implemented. However, a solution that have little to no impact on existing code would be nice. Just like ping, we need a way to specify an interface by its **interface name** (and not by its associated IP address), and we need to allow users to select which Source IP address to use when there are multiple addresses associated with an interface.

Regards,
Martin


More information about the Linux-nvme mailing list