[PATCH 1/1] Add 'Transport Interface' (triface) option. This can be used to specify the IP interface to use for the connection. The driver uses that to set SO_BINDTODEVICE on the socket before connecting.
Sagi Grimberg
sagi at grimberg.me
Mon May 10 19:13:36 BST 2021
>>> ping <dest-ip-addr>%<interface>
>>
>> Ping only supports this syntax for IPv6 no?
>>
>>> Extending this approach to nvme-cli we arrive to something like this:
>>>
>>> nvme discover --traddr 100.64.29.2%enp0s8 --host-traddr 192.168.56.102
>> ....
>>
>> We already support this for IPv6, we can do that also for IPv4, but this syntax
>> may not be trivially expected for ipv4?
>
> I tried this for IPv6 and it doesn't work. Here's what I get:
> $ sudo nvme discover -g -G -t tcp -s 8009 -a fe80::800:27ff:fe00:0
> Failed to write to /dev/nvme-fabrics: Invalid argument
> $ sudo nvme discover -g -G -t tcp -s 8009 -a fe80::800:27ff:fe00:0%enp0s8
> Failed to write to /dev/nvme-fabrics: Invalid argument
> $ sudo nvme discover -g -G -t tcp -s 8009 -a [fe80::800:27ff:fe00:0]
> failed to resolve host [fe80::800:27ff:fe00:0] info
> $ sudo nvme discover -g -G -t tcp -s 8009 -a [fe80::800:27ff:fe00:0%enp0s8]
> failed to resolve host [fe80::800:27ff:fe00:0%enp0s8] info
# nvme discover -t tcp -a fe80::5054:ff:fef1:9f3b -w
fe80::5054:ff:fe28:5edb%enp6s0
Discovery Log Number of Records 1, Generation counter 5
=====Discovery Log Entry 0======
trtype: tcp
adrfam: ipv6
subtype: nvme subsystem
treq: not specified, sq flow control disable supported
portid: 3
trsvcid: 8009
subnqn: testnqn1
traddr: fe80::5054:ff:fef1:9f3b%enp6s0
sectype: none
>
>>
>>> This tells nvme to connect to 100.64.29.2 on interface enp0s8. We make no
>> change to the --host-traddr option. It continues to be used to specify the
>> Source IP address only (for the rare cases where users want to specify a
>> Source Address other than the default). With this, the interface is specified
>> by name and not by its associated address. This is not only more intuitive,
>> but, as I stated before, eliminates the problem caused by mapping the same
>> IP address to multiple interfaces (not to mention that doing a reverse lookup
>> on an IP address to find the interface is extra work that we don’t need to do
>> in kernel space).
>>
>> Maybe we do something like ping -I for host_traddr, from ping man pages:
>>
>> -I interface
>> interface is either an address, an interface name or a VRF name. If
>> interface is an address, it sets source address to specified interface address.
>> If interface is an
>> interface name, it sets source interface to specified interface. If
>> interface is a VRF name, each packet is routed using the corresponding
>> routing table; in this case, the -I
>> option can be repeated to specify a source address. NOTE:
>> For IPv6, when doing ping to a link-local scope address, link specification (by
>> the '%'-notation in destination, or
>> by this option) can be used but it is no longer required.
>>
>>
>> Without the repetition though, unless we need to support two interfaces
>> that share the same multiple addresses in the same subnet, which sounds
>> completely crazy to me...
>
> Hi Sagi,
>
> If we want to follow ping as an example, the repetition is needed not to specify two interfaces, but to specify an interface and the source address. In a previous example (reproduced below), I described a configuration where an interface had several addresses assigned to it. By default, Linux always picks the same Source address (i.e. 192.168.56.101 in this example) when connecting. If a user wants a different source address they need a way to specify it (currently with --host-traddr). Users also need a way to specify an interface separately from the source address (either with a new option like --host-iface or by repeating --host-traddr). With the example below, if we wanted to force ping to use interface enp0s8 and source address 192.168.56.103, we would repeat the -I option, for example "ping -I enp0s8 -I 192.168.56.103". We need a way to do the same with nvme-cli.
>
> I thought that introducing a new option, "--host-iface", had the smallest impact since it requires less code changes, but that was turned down (not sure exactly why). I then suggested that we use the '%' delimiter for IPv4 and IPv6. I agree that it is not 100% the same as ping since ping only allows the '%' delimiter for IPv6 addresses (as per RFC4007). As you suggested, we could repeat the --host-traddr option (e.g. --host-traddr enp0s8 --host-traddr 192.168.56.103), but this is more impactful to the code than adding a separate --host-iface option.
It's less about code-changes and more on adding a new user ABI, that is
the reason why (at least I'm fully on board just yet).
> EXAMPLE: Interface with several addresses assigned:
> $ ip addr list dev enp0s8
> 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
> link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
> inet 192.168.56.101/24 brd 192.168.56.255 scope ...
> valid_lft 426sec preferred_lft 426sec
> inet 192.168.56.102/24 scope global secondary enp0s8
> valid_lft forever preferred_lft forever
> inet 192.168.56.103/24 scope global secondary enp0s8
> valid_lft forever preferred_lft forever
> inet 192.168.56.104/24 scope global secondary enp0s8
> valid_lft forever preferred_lft forever
>
> In the end, it doesn't really matter (to me) how it is implemented. However, a solution that have little to no impact on existing code would be nice. Just like ping, we need a way to specify an interface by its **interface name** (and not by its associated IP address), and we need to allow users to select which Source IP address to use when there are multiple addresses associated with an interface.
The '%' may be confusing when it comes to other transports as well (e.g.
rdma/fc would have to either reject or ignore it, but regardless of how
we add it that would be the case). Having host-traddr accept either ip
or interface seems the most desirable, however that won't work if there
are 2 interfaces that share multiple ip addresses. So if this is a
requirement we'll probably need to add --host-iface as another option...
More information about the Linux-nvme
mailing list