tunnel upstream is unusably slow

Wed Apr 2 11:58:28 EDT 2014

On 02.04.2014 16:41, David Woodhouse wrote:
> On Wed, 2014-04-02 at 15:46 +0200, Ortwin Glück wrote:
>> $ strace -ttfp 29785
> 
> After manually editing that to undo the word-wrapping and make it
> readable again, it looks like the two large delays you see in select()
> are when it's waiting for data from the tunnel. I'm not sure it can make
> that happen faster. The problem here isn't the way the event loop works,
> AFAICT.

Right, the problem is the way packets come in through tun by select/read.

I have just noticed that this is a regression in kernel 3.14! Works fine
in 3.12 and 3.13.

Sorry to bother you. I will bisect and speak to kernel folks.

Rest of the answer for completness.

> 
>> Question remains: why does openconnect's select block for so long? A lot
>> of data should be already available in the socket buffer.
> 
> It's not the socket buffer we care about, from the openconnect point of
> view. It's packets being "sent" on the tunnel interface. And openconnect
> is sending those as soon as they get here...

Yes ok, my wrong wording. Call it the send queue.

>> Here you go. I have replaced the actual IPs with names for convenience
>> and privacy. Timestamps are seconds since previous packet. Note the
>> 0.214s delays before local starts sending again after last ACK.
> 
> Hm, I was hoping for timestamps which would correlate with the ones in
> the strace run, but I can cope with this...

These were not run at the same time anyway.

>> No.     Time     Source                Destination           Protocol    Length Info
>>    1100 0.214    tun-local             tun-remote            SSHv2    1280   Encrypted request packet len=1228
>>    1101 0.000    SSL-local             SSL-remote            TLSv1    1126   Application Data
>>    1102 0.000    SSL-local             SSL-remote            TLSv1    328    Application Data
> 
> A packet is 'sent' over the tun interface. OpenConnect immediately sends
> it. It takes *two* packets to send it, since your MTU inside the tunnel
> is larger than your MTU outside the tunnel. 

No matter what value I pass to --mtu I always end up with 2 packets -
just their sizes change. What do you suggest? Anyhow that's certainly
not the reason for the performance issue at hand.

> Why aren't you using DTLS?

Apr  2 16:47:08 localhost openconnect[3195]: DTLS handshake failed: 1
Apr  2 16:47:08 localhost openconnect[3195]:
140213524236008:error:14092073:SSL routines:SSL3_GET_SERVER_HELLO:bad
packet length:s3_clnt.c:1063:

Reason: I am using openSSL (the required GnuTLS version is too high for
the rest of the system). Switching the GnuTLS solves that. But the
performance is the same (select() call is unaffected by that).

>>    1107 0.214    tun-local             tun-remote            SSHv2    1280   Encrypted request packet len=1228
> 
> There is a 214ms delay before the next packet is sent by SSH (or more to
> the point the TCP stack). The sequence repeats...

Yes, that's what I am saying.

>>    1108 0.000    SSL-local             SSL-remote            TLSv1    1126   Application Data
>>    1109 0.000    SSL-local             SSL-remote            TLSv1    328    Application Data
>>    1110 0.007    SSL-remote            SSL-local             TCP      60     443 > 54547 [ACK] Seq=19851 Ack=205687 Win=32768 Len=0
>>    1111 0.001    SSL-remote            SSL-local             TLSv1    147    Application Data
>>    1112 0.000    SSL-local             SSL-remote            TCP      54     54547 > 443 [ACK] Seq=205961 Ack=19944 Win=65535 Len=0
>>    1113 0.000    tun-remote            tun-local             TCP      52     22 > 37376 [ACK] Seq=6189 Ack=186064 Win=64128 Len=0
>>    1114 0.214    tun-local             tun-remote            SSHv2    1280   Encrypted request packet len=1228
> 
> Why is it taking 7ms for a pair of TCP packets to reach your VPN server
> and then for an ACK to come back for them?

Interesting question. Ping to tunnel endpoint is around 0.06 ms (100
times less).

> I can't see the sequence numbers for your outgoing packets but I have
> strong suspicion that there's a fair amount of data in-flight at any
> given time.

No, there is only ever around 4000 bytes in flight (looking at the SSL
connection), and only ever 1354 bytes (one packet! when looking at the
tunnel interface). That's what I am saying: it doesn't saturate the
bandwidth at all! You can see by the window size in the ACK from
SSL-remote: 32768 and tun-remote: 64128 (higher?!?). It's a complete
ping/pong of ACK/data/ACK/data without filling the TCP window. Like its
waiting for something.

> So it's not that the SSH connection sends a data packet,
> gets an ACK within 7ms and then gratuitously waits around for almost
> quarter of a second before sending any more data. I suspect it's got
> packets outstanding and the TCP stack has determined that this is the
> maximum rate it can send at.
> 
> What's the TCP sendq on the offending SSH socket (and the SSL socket for
> that matter, but I suspect that'll be low) when this happens?
>  (netstat -tn)

SSH connection:
tcp        0  29788 tun-local:46577  tun-remote:22   ESTABLISHED
SSL VPN connection:
tcp        0      0 SSL-local:55422  SSL-remote:443  ESTABLISHED

Again it's what I have been stating: the sendq of SSH is always full
(and the VPN side is completely empty of course). But select still takes
>200ms to get a signal...