Supplicant Recovery on Roam Failures

Thu Dec 27 13:43:36 PST 2012

On Thu, Dec 27, 2012 at 04:18:26PM -0500, Matt Causey wrote:
> I have a question that hopefully you can help me with.  We run
> wpa_supplicant 1.x and Linux 2.6.39 in a somewhat dense Cisco-based WLAN.
> There are lots of clients and they spend a lot of time roaming between
> access points.  Intermittently, the client gets into a state where
> wpa_supplicant has received an RSSI_LOW event, and has selected a BSSID
> that has better coverage.  The client gets to
> ieee80211_send_probe_req<http://lxr.free-electrons.com/ident?i=ieee80211_send_probe_req>()
> [1], and the BSSID that it has selected fails to respond to broadcast
> probes.:

Could you please test this with the current 2.0-devel snapshot of
wpa_supplicant (i.e., snapshot of the master branch in
git://w1.fi/srv/git/hostap.git)? There has been number of changes
improving reaction to cases where authentication/association fails for
whatever reason (very much including the cases where the AP network
tries to use some proprietary load balancing mechanism that is likely
the case here).

> In the 802.11 capture data, we see clearly that the client is sending valid
> Probe Requests, and the access point is failing to respond, though the same
> access point is responding to other clients.  We have the vendor engaged to
> answer the question of why the access point is not responding.

I'd assume you have either load balancing or band steering enabled on
the Cisco APs and that makes it not reply to some Probe Request frames.
In this particular case the new AP seemed to be on the 5 GHz band, so
I'd assume this was not caused by band steering, but load balancing is
likely to have similar effects. I cannot say that I like the way these
APs try to force load balancing, but well, that's what's out there.. It
is especially harmful with the mac80211 mechanism of using Probe Request
frames to probe the specific AP and it would have been nice if that
mechanism would have been used only for the broadcast probe case, but I
guess it could apply here, too.

> Obviously, the access point is in the client's scan list and is not dead,
> and should be responding.  While discussions with the vendor are underway
> though, I have a question about the way that wpa_supplicant responds to
> this scenario.

If you want to get a quick way of determine whether the load balancing
mechanism is behind this issue, you could run a test with the WLAN
configuration on the Cisco setup disabling "Client Load Balancing" and
"Client Band Select".

> What appears to happen after the MLME layer issues the times-out, is that
> the supplicant then kicks off a scan and disconnects.  In our dense
> environment, this scan takes seconds to complete, and breaks our clients.
> In the case where these failures happen, the client already has a very full
> scan list that contains a lot of very solid roam choices.  Is there some
> way that I can cause the supplicant to attempt the next BSS on the list
> rather than invoking a whole new scan?

This is the part that has changed a lot in the latest wpa_supplicant and
as such, I'd like to see a debug log from such a run. wpa_supplicant
should use the temporary blacklist mechanism to allow other BSSes to be
tried and there is now some cases where the "extra" scan can be avoided
or at least limited to a subset of channels (though, in your particular
case, even that may end up being rather large set of channels). There is
now better framework in place for allowing old scan results to be used,
so it should be much easier to extend these mechanisms based on the new
debug log if needed.

-- 
Jouni Malinen                                            PGP id EFC895FA