Dead Peer Detection

Thu May 12 19:33:20 EDT 2011

On Thu, 2011-05-12 at 22:12 +0200, jonathan_jung at web.de wrote:
> Thanks for the help and sorry for the strange/bad english!

The English is perfectly sufficient, but please could you post plain
text email only, not HTML? I accepted the message after it was trapped
for moderation, but now I'm regretting it as I try to make sense of your
output :)

> Connected tun0 as 141.89.47.28, using SSL
> No work to do; sleeping for 20000 ms...
> DTLS handshake timed out
> DTLS handshake failed: 2
> Send CSTP Keepalive
> No work to do; sleeping for 10000 ms...
> Send CSTP DPD
> No work to do; sleeping for 15000 ms...
> Send CSTP DPD
> No work to do; sleeping for 15000 ms...
> Send CSTP DPD
> No work to do; sleeping for 15000 ms...
> CSTP Dead Peer Detection detected dead peer!

The 'dead peer detection' is just a simple ping/pong. We send a DPD
request to the server, and we expect it to reply so that we know we're
still connected.

Except we're *not* still connected. We never get the reply.

> Failed to reconnect to host wlanvpn.uni-potsdam.de
> sleep 10s, remaining timeout 300s

... and we don't even manage to reconnect.

Your tcpdump seems to start just before the dead peer message, and shows
that we keep sending SYN packets to the server to make a new connection,
but they never get a response.

> Nr. 2 - sbin/route -n before the dead peer detection happens:
> /sbin/route -n
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use Iface
> 172.16.3.251 172.16.3.254 255.255.255.255 UGH 0 0 0 wlan0
> 141.89.46.0 141.89.46.106 255.255.255.0 UG 0 0 0 tun0
> 172.16.0.0 0.0.0.0 255.255.252.0 U 0 0 0 wlan0
> 0.0.0.0 141.89.46.106 0.0.0.0 UG 0 0 0 tun0

Ick that's unreadable. I'll try undoing the HTML-damage, and it looks
like this:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.16.3.251    172.16.3.254    255.255.255.255 UGH   0      0        0 wlan0
141.89.46.0     141.89.46.106   255.255.255.0   UG    0      0        0 tun0
172.16.0.0      0.0.0.0         255.255.252.0   U     0      0        0 wlan0
0.0.0.0         141.89.46.106   0.0.0.0         UG    0      0        0 tun0

This looks suspicious.

Your local wireless network is 172.16.0.0/22, so every host from
172.16.0.0 through to 172.16.3.255 is directly on that network segment?

Your VPN server is 172.16.3.251, so shouldn't it be *directly* on the
same network as you? But you have a route to the VPN server *through*
the gateway at 172.16.3.254?

That might explain why your packets aren't reaching the server.

What does your routing table look like after you join the VPN, *before*
you try to connect?

We have special logic in the vpnc-script to preserve your route to the
VPN server, but route everything *else* through the VPN tunnel. I bet
that isn't very well-tested with a VPN server that is actually *on* your
local subnet, and that's why you end up with this strange routing.

If you just run '/sbin/route del 172.16.3.251' as soon as it's
connected, and before it times out, that should delete the rogue route
to the VPN server via that gateway, and let you route to the VPN server
normally. I suspect that'll fix it?

I'll try connecting to my own VPN server through a *proxy* on the local
network, which should trigger the same kind of routing situation for
vpnc-script to screw up, and see if I can reproduce and fix it properly.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse at intel.com                              Intel Corporation