[PATCH net-next v2 2/6] tunnels: PMTU discovery support for directly bridged IP packets

Wed Aug 5 12:54:03 EDT 2020

On Tue, 4 Aug 2020 at 11:24, Stefano Brivio <sbrivio at redhat.com> wrote:
>
> It's currently possible to bridge Ethernet tunnels carrying IP
> packets directly to external interfaces without assigning them
> addresses and routes on the bridged network itself: this is the case
> for UDP tunnels bridged with a standard bridge or by Open vSwitch.
>
> PMTU discovery is currently broken with those configurations, because
> the encapsulation effectively decreases the MTU of the link, and
> while we are able to account for this using PMTU discovery on the
> lower layer, we don't have a way to relay ICMP or ICMPv6 messages
> needed by the sender, because we don't have valid routes to it.
>
> On the other hand, as a tunnel endpoint, we can't fragment packets
> as a general approach: this is for instance clearly forbidden for
> VXLAN by RFC 7348, section 4.3:
>
>    VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
>    fragment encapsulated VXLAN packets due to the larger frame size.
>    The destination VTEP MAY silently discard such VXLAN fragments.
>
> The same paragraph recommends that the MTU over the physical network
> accomodates for encapsulations, but this isn't a practical option for
> complex topologies, especially for typical Open vSwitch use cases.
>
> Further, it states that:
>
>    Other techniques like Path MTU discovery (see [RFC1191] and
>    [RFC1981]) MAY be used to address this requirement as well.
>
> Now, PMTU discovery already works for routed interfaces, we get
> route exceptions created by the encapsulation device as they receive
> ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
> we already rebuild those messages with the appropriate MTU and route
> them back to the sender.
>
> Add the missing bits for bridged cases:
>
> - checks in skb_tunnel_check_pmtu() to understand if it's appropriate
>   to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
>   RFC 4443 section 2.4 for ICMPv6. This function is already called by
>   UDP tunnels
>
> - a new function generating those ICMP or ICMPv6 replies. We can't
>   reuse icmp_send() and icmp6_send() as we don't see the sender as a
>   valid destination. This doesn't need to be generic, as we don't
>   cover any other type of ICMP errors given that we only provide an
>   encapsulation function to the sender
>
> While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
> we might receive GSO buffers here, and the passed headroom already
> includes the inner MAC length, so we don't have to account for it
> a second time (that would imply three MAC headers on the wire, but
> there are just two).
>
> This issue became visible while bridging IPv6 packets with 4500 bytes
> of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
> bytes of encapsulation headroom, we would advertise MTU as 3950, and
> we would reject fragmented IPv6 datagrams of 3958 bytes size on the
> wire. We're exclusively dealing with network MTU here, though, so we
> could get Ethernet frames up to 3964 octets in that case.
>
> v2:
> - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
> - split IPv4/IPv6 functions (David Ahern)
>
> Signed-off-by: Stefano Brivio <sbrivio at redhat.com>
> ---
>  drivers/net/bareudp.c     |   5 +-
>  drivers/net/geneve.c      |   5 +-
>  drivers/net/vxlan.c       |   4 +-
>  include/net/dst.h         |  10 --
>  include/net/ip_tunnels.h  |   2 +
>  net/ipv4/ip_tunnel_core.c | 244 ++++++++++++++++++++++++++++++++++++++
>  6 files changed, 254 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
> index 3b6664c7e73c..841910f1db65 100644
> --- a/drivers/net/bareudp.c
> +++ b/drivers/net/bareudp.c
> @@ -308,7 +308,7 @@ static int bareudp_xmit_skb(struct sk_buff *skb, struct net_device *dev,
>                 return PTR_ERR(rt);
>
>         skb_tunnel_check_pmtu(skb, &rt->dst,
> -                             BAREUDP_IPV4_HLEN + info->options_len);
> +                             BAREUDP_IPV4_HLEN + info->options_len, false);
>
>         sport = udp_flow_src_port(bareudp->net, skb,
>                                   bareudp->sport_min, USHRT_MAX,
> @@ -369,7 +369,8 @@ static int bareudp6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
>         if (IS_ERR(dst))
>                 return PTR_ERR(dst);
>
> -       skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len);
> +       skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len,
> +                             false);
>
>         sport = udp_flow_src_port(bareudp->net, skb,
>                                   bareudp->sport_min, USHRT_MAX,
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 017c13acc911..de86b6d82132 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -894,7 +894,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
>                 return PTR_ERR(rt);
>
>         skb_tunnel_check_pmtu(skb, &rt->dst,
> -                             GENEVE_IPV4_HLEN + info->options_len);
> +                             GENEVE_IPV4_HLEN + info->options_len, false);
>
>         sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
>         if (geneve->cfg.collect_md) {
> @@ -955,7 +955,8 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
>         if (IS_ERR(dst))
>                 return PTR_ERR(dst);
>
> -       skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len);
> +       skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len,
> +                             false);
>
>         sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
>         if (geneve->cfg.collect_md) {
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 77658425db8a..1432544da507 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -2720,7 +2720,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>                 }
>
>                 ndst = &rt->dst;
> -               skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM);
> +               skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM, false);
>
>                 tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
>                 ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
> @@ -2760,7 +2760,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>                                 goto out_unlock;
>                 }
>
> -               skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM);
> +               skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM, false);
>
>                 tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
>                 ttl = ttl ? : ip6_dst_hoplimit(ndst);
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 852d8fb36ab7..6ae2e625050d 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -535,14 +535,4 @@ static inline void skb_dst_update_pmtu_no_confirm(struct sk_buff *skb, u32 mtu)
>                 dst->ops->update_pmtu(dst, NULL, skb, mtu, false);
>  }
>
> -static inline void skb_tunnel_check_pmtu(struct sk_buff *skb,
> -                                        struct dst_entry *encap_dst,
> -                                        int headroom)
> -{
> -       u32 encap_mtu = dst_mtu(encap_dst);
> -
> -       if (skb->len > encap_mtu - headroom)
> -               skb_dst_update_pmtu_no_confirm(skb, encap_mtu - headroom);
> -}
> -
>  #endif /* _NET_DST_H */
> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
> index 36025dea7612..02ccd32542d0 100644
> --- a/include/net/ip_tunnels.h
> +++ b/include/net/ip_tunnels.h
> @@ -420,6 +420,8 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
>                    u8 tos, u8 ttl, __be16 df, bool xnet);
>  struct metadata_dst *iptunnel_metadata_reply(struct metadata_dst *md,
>                                              gfp_t flags);
> +int skb_tunnel_check_pmtu(struct sk_buff *skb, struct dst_entry *encap_dst,
> +                         int headroom, bool reply);
>
>  int iptunnel_handle_offloads(struct sk_buff *skb, int gso_type_mask);
>
> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
> index f8b419e2475c..9ddee2a0c66d 100644
> --- a/net/ipv4/ip_tunnel_core.c
> +++ b/net/ipv4/ip_tunnel_core.c
> @@ -184,6 +184,250 @@ int iptunnel_handle_offloads(struct sk_buff *skb,
>  }
>  EXPORT_SYMBOL_GPL(iptunnel_handle_offloads);
>
> +/**
> + * iptunnel_pmtud_build_icmp() - Build ICMP error message for PMTUD
> + * @skb:       Original packet with L2 header
> + * @mtu:       MTU value for ICMP error
> + *
> + * Return: length on success, negative error code if message couldn't be built.
> + */
> +static int iptunnel_pmtud_build_icmp(struct sk_buff *skb, int mtu)
> +{
> +       const struct iphdr *iph = ip_hdr(skb);
> +       struct icmphdr *icmph;
> +       struct iphdr *niph;
> +       struct ethhdr eh;
> +       int len, err;
> +
> +       if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct iphdr)))
> +               return -EINVAL;
> +
> +       skb_copy_bits(skb, skb_mac_offset(skb), &eh, ETH_HLEN);
> +       pskb_pull(skb, ETH_HLEN);
> +       skb_reset_network_header(skb);
> +
> +       err = pskb_trim(skb, 576 - sizeof(*niph) - sizeof(*icmph));
> +       if (err)
> +               return err;
> +
> +       len = skb->len + sizeof(*icmph);
> +       err = skb_cow(skb, sizeof(*niph) + sizeof(*icmph) + ETH_HLEN);
> +       if (err)
> +               return err;
> +
> +       icmph = skb_push(skb, sizeof(*icmph));
> +       *icmph = (struct icmphdr) {
> +               .type                   = ICMP_DEST_UNREACH,
> +               .code                   = ICMP_FRAG_NEEDED,
> +               .checksum               = 0,
> +               .un.frag.__unused       = 0,
> +               .un.frag.mtu            = ntohs(mtu),
> +       };
> +       icmph->checksum = ip_compute_csum(icmph, len);
> +       skb_reset_transport_header(skb);
> +
> +       niph = skb_push(skb, sizeof(*niph));
> +       *niph = (struct iphdr) {
> +               .ihl                    = sizeof(*niph) / 4u,
> +               .version                = 4,
> +               .tos                    = 0,
> +               .tot_len                = htons(len + sizeof(*niph)),
> +               .id                     = 0,
> +               .frag_off               = htons(IP_DF),
> +               .ttl                    = iph->ttl,
> +               .protocol               = IPPROTO_ICMP,
> +               .saddr                  = iph->daddr,
> +               .daddr                  = iph->saddr,
> +       };
> +       ip_send_check(niph);
> +       skb_reset_network_header(skb);
> +
> +       skb->ip_summed = CHECKSUM_NONE;
> +
> +       eth_header(skb, skb->dev, htons(eh.h_proto), eh.h_source, eh.h_dest, 0);
> +       skb_reset_mac_header(skb);
> +
> +       return skb->len;
> +}
> +
> +/**
> + * iptunnel_pmtud_check_icmp() - Trigger ICMP reply if needed and allowed
> + * @skb:       Buffer being sent by encapsulation, L2 headers expected
> + * @mtu:       Network MTU for path
> + *
> + * Return: 0 for no ICMP reply, length if built, negative value on error.
> + */
> +static int iptunnel_pmtud_check_icmp(struct sk_buff *skb, int mtu)
> +{
> +       const struct icmphdr *icmph = icmp_hdr(skb);
> +       const struct iphdr *iph = ip_hdr(skb);
> +
> +       if (mtu <= 576 || iph->frag_off != htons(IP_DF))
> +               return 0;
> +
> +       if (ipv4_is_lbcast(iph->daddr)  || ipv4_is_multicast(iph->daddr) ||
> +           ipv4_is_zeronet(iph->saddr) || ipv4_is_loopback(iph->saddr)  ||
> +           ipv4_is_lbcast(iph->saddr)  || ipv4_is_multicast(iph->saddr))
> +               return 0;
> +
> +       if (iph->protocol == IPPROTO_ICMP && icmp_is_err(icmph->type))
> +               return 0;
> +
> +       return iptunnel_pmtud_build_icmp(skb, mtu);
> +}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +/**
> + * iptunnel_pmtud_build_icmpv6() - Build ICMPv6 error message for PMTUD
> + * @skb:       Original packet with L2 header
> + * @mtu:       MTU value for ICMPv6 error
> + *
> + * Return: length on success, negative error code if message couldn't be built.
> + */
> +static int iptunnel_pmtud_build_icmpv6(struct sk_buff *skb, int mtu)
> +{
> +       const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> +       struct icmp6hdr *icmp6h;
> +       struct ipv6hdr *nip6h;
> +       struct ethhdr eh;
> +       int len, err;
> +       __wsum csum;
> +
> +       if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct ipv6hdr)))
> +               return -EINVAL;
> +
> +       skb_copy_bits(skb, skb_mac_offset(skb), &eh, ETH_HLEN);
> +       pskb_pull(skb, ETH_HLEN);
> +       skb_reset_network_header(skb);
> +
> +       err = pskb_trim(skb, IPV6_MIN_MTU - sizeof(*nip6h) - sizeof(*icmp6h));
> +       if (err)
> +               return err;
> +
> +       len = skb->len + sizeof(*icmp6h);
> +       err = skb_cow(skb, sizeof(*nip6h) + sizeof(*icmp6h) + ETH_HLEN);
> +       if (err)
> +               return err;
> +
> +       icmp6h = skb_push(skb, sizeof(*icmp6h));
> +       *icmp6h = (struct icmp6hdr) {
> +               .icmp6_type             = ICMPV6_PKT_TOOBIG,
> +               .icmp6_code             = 0,
> +               .icmp6_cksum            = 0,
> +               .icmp6_mtu              = htonl(mtu),
> +       };
> +       skb_reset_transport_header(skb);
> +
> +       nip6h = skb_push(skb, sizeof(*nip6h));
> +       *nip6h = (struct ipv6hdr) {
> +               .priority               = 0,
> +               .version                = 6,
> +               .flow_lbl               = { 0 },
> +               .payload_len            = htons(len),
> +               .nexthdr                = IPPROTO_ICMPV6,
> +               .hop_limit              = ip6h->hop_limit,
> +               .saddr                  = ip6h->daddr,
> +               .daddr                  = ip6h->saddr,
> +       };
> +       skb_reset_network_header(skb);
> +
> +       csum = csum_partial(icmp6h, len, 0);
> +       icmp6h->icmp6_cksum = csum_ipv6_magic(&nip6h->saddr, &nip6h->daddr, len,
> +                                             IPPROTO_ICMPV6, csum);

Linux next build breaks for riscv architecture defconfig build.

    git_repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
    target_arch: riscv
    toolchain: gcc-9
    git_short_log: d15fe4ec0435 (\Add linux-next specific files for 20200805\)
    git_describe: next-20200805

Reported-by: Naresh Kamboju <naresh.kamboju at linaro.org>

make -sk KBUILD_BUILD_USER=TuxBuild -C/linux ARCH=riscv
CROSS_COMPILE=riscv64-linux-gnu- HOSTCC=gcc CC="sccache
riscv64-linux-gnu-gcc" O=build defconfig
#
#
# make -sk KBUILD_BUILD_USER=TuxBuild -C/linux -j16 ARCH=riscv
CROSS_COMPILE=riscv64-linux-gnu- HOSTCC=gcc CC="sccache
riscv64-linux-gnu-gcc" O=build
#
../net/ipv4/ip_tunnel_core.c: In function ‘iptunnel_pmtud_build_icmpv6’:
../net/ipv4/ip_tunnel_core.c:335:24: error: implicit declaration of
function ‘csum_ipv6_magic’; did you mean ‘csum_tcpudp_magic’?
[-Werror=implicit-function-declaration]
  335 |  icmp6h->icmp6_cksum = csum_ipv6_magic(&nip6h->saddr,
&nip6h->daddr, len,
      |                        ^~~~~~~~~~~~~~~
      |                        csum_tcpudp_magic
cc1: some warnings being treated as errors
make[3]: *** [../scripts/Makefile.build:283: net/ipv4/ip_tunnel_core.o] Error 1

Full build log link,
https://gitlab.com/Linaro/lkft/kernel-runs/-/jobs/671651045

-- 
Linaro LKFT
https://lkft.linaro.org