Followup: OpenConnect unusably slow

Wed Jun 19 15:16:23 EDT 2013

Am 19.06.2013 18:14, schrieb David Woodhouse:
> On Wed, 2013-06-19 at 17:23 +0200, Thomas Richter wrote:
>> If this is of any help: This is a DSL line by the German Telekom, which
>> is up to my knowledge PPPoe between my router and their system. But what
>> the MTU is, or whether MTU actually applies to this technology I do not
>> know.
>
> Your MTU is likely to be 1492. There's 8 bytes of overhead for the PPP
> part of PPPoE, which we subtract from the normal Ethernet MTU of 1500.

Which is strange because the router here gives me a MTU of 1500. I 
checked the router configuration, and indeed, it is a PPPoe connection I 
have. No filtering is enabled, but I find the following interesting 
entry in the log:
10.12.2010 09:26:11 PPPoE startet PPP
10.12.2010 09:26:11 PPPoE empfange PADS
10.12.2010 09:26:11 PPPoE sende PADR
10.12.2010 09:26:11 PPPoE empfange PADO
10.12.2010 09:26:10 PPPoE sende PADI
10.12.2010 09:26:10 DSL ist verfügbar(DSL- Synchronisierung besteht).(R007)
10.12.2010 09:26:00 DSL-Synchronisation beginnt(Training).(R008)
18.06.2013 23:20:58 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 59947 (von PPPoE - Eingang)
18.06.2013 23:20:56 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 59947 (von PPPoE - Eingang)
18.06.2013 23:20:55 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 59947 (von PPPoE - Eingang)
18.06.2013 23:20:54 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 59947 (von PPPoE - Eingang)
18.06.2013 23:15:43 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:39 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:37 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:36 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:29 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:27 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:26 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:25 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:17 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)
18.06.2013 23:15:14 **fragmentation flood** 129.69.90.139, 443->> 
192.168.2.103, 48924 (von PPPoE - Eingang)

German for "from PPPoE - Input". The IP 129.69.90.139 is one out of the 
several VPN end points of the AnyConnect cluster in our computing center 
(oh joy!) and 443 is (of course) the https port they are using. So I 
guess we're getting closer. 192.168.2.103 is (of course) the local IP 
after NATting by the router.

So, could it be that either the router is not telling me the right MTU, 
or that something icky happens further upstream at the AnyConnect 
configuration?

> I note that then you are missing the incoming DTLS packets, you are
> *actually* receiving one fragment out of the pair. This is often a
> symptom of a poor network connection or driver. What's "special" about
> fragmented packets (if they're fragmented locally) is that they are sent
> in *quick* succession, really closely together. And often the second one
> is lost.
>
> However, in your case in the problematic period I'm looking at
> (07:36:52) you are receiving the *second* fragment out of each packet
> (offset 744 onwards), not the first. Which is really weird. That's
> almost as if some firewall kicked in and started rejecting the start of
> those packets (which contains the UDP port number etc.) but still had to
> allow the fragments because it didn't know whether it could drop them.
> Or something weird like that.

Uh, that's wierd indeed.

> Can you get a packet capture from the public-facing side of your home
> network, on the outbound interface of your DSL router? Or as far
> "upstream" as you can, preferably on the outside of your NAT?

Probably not behind the router because I do not have the tools to 
inspect the line there, but I can try to reach out to the network guys 
in our computing center, and try to get a network sniff there to see 
whether there is a problem on this end. What is interesting indeed is that

> It would be interesting to get a packet capture from a router at the VPN
> server end if you can, too. We'd check that they really are *sending*
> the offending packets, and whether they're receiving any ICMP responses.

I'll try to get the folks on the line, hopefully.

>>> It's entirely possible that some complete retard would respond to this
>>> breakage by deciding to fragment *all* outbound packets to a maximum of
>>> 750 bytes, instead of fixing the broken firewall.
>>>
>>> It isn't necessarily related to your local network at all. It *might*
>>> have been doing that before, on your working system. It probably was.
>>
>> Anyhow, I kept playing a bit, and set the MTU to 700 manually in a test
>> script. Result is that network connectivity becomes normal. So indeed,
>> packet segmentation *is* the source of the evil.
>
> How precise is that? Does 700 work and 701 fail with missing packets?
> I'd have thought it would be higher.

Not yet tried so far, but at least 700 *is* sufficient to fix the 
problem. Probably I'll try just the 1500 minus the overhead of PPPoE, 
minus the overhead of VPN and bisect it from both ends. This will take a 
while.

>> However, why did this work before,
>
> I don't know. Can you make one of your machines fail again by
> downgrading (perhaps just openconnect and openssl) to what you had
> before? Even if you just install the old openconnect and relevant
> libraries in a chroot? It would be very interesting to compare.

I'll try. I don't need the old one so much, but it certainly takes a 
couple of hours I don't have right now (but on the weekend). I also have 
to chat with our networking guys because it could also be that they 
screwed something up on their end.

But at least, thanks a lot, we got a big step further.

>> and - probably more important - how I  can make it work with the network manager?
>
> Simple answer:
>   mv /usr/sbin/openconnect /usr/sbin/openconnect.real
>   cat>>  /usr/sbin/openconnect<<EOF
>   #!/bin/sh
>   exec $0.real --mtu=700 "$@"
>   EOF
>   chmod +x /usr/sbin/openconnect

I believe I tried something as stupid as this... Must have missed 
something very obvious because this did not do as it should.

> Ideally, we should add an option to the NetworkManager config properly,
> for this and --no-dtls. And/or maybe just an arbitrary 'passthrough' of
> manually-entered options to add to the openconnect command line.

Which would, indeed, be very helpful.

>> Neither do I. Did openconnect possibly adjust the MTU dynamically in
>> earlier releases?
>
> No. ISTR your "upgraded" machines are still on the pathetically ancient
> 3.20 release; I dread to think what you were using before that. Perhaps
> it was *so* old that it didn't do DTLS at all?

No, it did DTLS. I remember that because the version I got from Debian 
squeeze was so old that it did not work at all, so I had to build my 
own, pre-3.20 at that time. Which then worked for a while, until I 
upgraded to wheezy. I noticed with the old Debian version already that 
DTLS did not work, but back then it was an issue with openssl requiring 
a patch for the broken cisco implementation. The setup then worked.

The version back then read: openconnect-1c15b37. Openssl should have 
been 0.9.8, but likely a debian-patched version and not a vanilla 0.9.8.

> Newer versions will attempt to do some slightly more complex MTU
> negotiation with the server (if the server is new enough to support
> that), but it's all a bit hit-and-miss. The Cisco scheme for doing this
> is very unreliable, and made even more so because we're not sure we
> fully understand it. I'd certainly suggest trying 5.01 and seeing what
> happens with it.
>
> When you're on the university network and things are working, do you
> *still* see these weirdly fragmented packets?

Not tried yet because I haven't had the time. My best bet would be that 
there is no issue. I'll hope to find time to do all that.

> Alternatively, if you're at home and you do something else to trigger
> incoming UDP packets from elsewhere, are *they* weirdly fragmented and
> sometimes missing?

What would you suggest, i.e. what else depends on UDP I might want to try?

I'll check out what I can reach with the folks in the networking 
department - thanks again.

Greetings,
	Thomas