tunnel upstream is unusably slow

Wed Apr 2 10:41:24 EDT 2014

On Wed, 2014-04-02 at 15:46 +0200, Ortwin Glück wrote:
> $ strace -ttfp 29785

After manually editing that to undo the word-wrapping and make it
readable again, it looks like the two large delays you see in select()
are when it's waiting for data from the tunnel. I'm not sure it can make
that happen faster. The problem here isn't the way the event loop works,
AFAICT.

> Question remains: why does openconnect's select block for so long? A lot
> of data should be already available in the socket buffer.

It's not the socket buffer we care about, from the openconnect point of
view. It's packets being "sent" on the tunnel interface. And openconnect
is sending those as soon as they get here...

> Here you go. I have replaced the actual IPs with names for convenience
> and privacy. Timestamps are seconds since previous packet. Note the
> 0.214s delays before local starts sending again after last ACK.

Hm, I was hoping for timestamps which would correlate with the ones in
the strace run, but I can cope with this...

> No.     Time     Source                Destination           Protocol    Length Info
>    1100 0.214    tun-local             tun-remote            SSHv2    1280   Encrypted request packet len=1228
>    1101 0.000    SSL-local             SSL-remote            TLSv1    1126   Application Data
>    1102 0.000    SSL-local             SSL-remote            TLSv1    328    Application Data

A packet is 'sent' over the tun interface. OpenConnect immediately sends
it. It takes *two* packets to send it, since your MTU inside the tunnel
is larger than your MTU outside the tunnel. That's probably not going to
help your performance. And neither is the fact that you appear not to be
using DTLS. Nobody who cares about performance should be using
TCP-over-TCP. Why aren't you using DTLS?

>    1103 0.007    SSL-remote            SSL-local             TCP      60     443 > 54547 [ACK] Seq=19758 Ack=204341 Win=32768 Len=0

Fairly quickly, an ACK comes back from the VPN server for that data.

>    1104 0.001    SSL-remote            SSL-local             TLSv1    147    Application Data
>    1105 0.000    SSL-local             SSL-remote            TCP      54     54547 > 443 [ACK] Seq=204615 Ack=19851 Win=65535 Len=0

1ms later, a data packet comes back over the SSL connection, and an ACK
is sent for it.

>    1106 0.000    tun-remote            tun-local             TCP      52     22 > 37376 [ACK] Seq=6189 Ack=184836 Win=64128 Len=0

We feed the data packet (which is itself an ACK for the SSH connection)
to the tun device with no noticeable delay.

>    1107 0.214    tun-local             tun-remote            SSHv2    1280   Encrypted request packet len=1228

There is a 214ms delay before the next packet is sent by SSH (or more to
the point the TCP stack). The sequence repeats...

>    1108 0.000    SSL-local             SSL-remote            TLSv1    1126   Application Data
>    1109 0.000    SSL-local             SSL-remote            TLSv1    328    Application Data
>    1110 0.007    SSL-remote            SSL-local             TCP      60     443 > 54547 [ACK] Seq=19851 Ack=205687 Win=32768 Len=0
>    1111 0.001    SSL-remote            SSL-local             TLSv1    147    Application Data
>    1112 0.000    SSL-local             SSL-remote            TCP      54     54547 > 443 [ACK] Seq=205961 Ack=19944 Win=65535 Len=0
>    1113 0.000    tun-remote            tun-local             TCP      52     22 > 37376 [ACK] Seq=6189 Ack=186064 Win=64128 Len=0
>    1114 0.214    tun-local             tun-remote            SSHv2    1280   Encrypted request packet len=1228

Why is it taking 7ms for a pair of TCP packets to reach your VPN server
and then for an ACK to come back for them?

I can't see the sequence numbers for your outgoing packets but I have
strong suspicion that there's a fair amount of data in-flight at any
given time. So it's not that the SSH connection sends a data packet,
gets an ACK within 7ms and then gratuitously waits around for almost
quarter of a second before sending any more data. I suspect it's got
packets outstanding and the TCP stack has determined that this is the
maximum rate it can send at.

What's the TCP sendq on the offending SSH socket (and the SSL socket for
that matter, but I suspect that'll be low) when this happens?
 (netstat -tn)

-- 
dwmw2
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5745 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/openconnect-devel/attachments/20140402/4b0fbe87/attachment.bin>