Issues w/ WDS links in Bridge -- crashes 0.0.4

Wed Aug 6 20:59:20 PDT 2003

Hi All,

We have seen hostap in bridge mode doing wds link crash.  If we have 10 wds
links active and we add an 11th, the ping times and latency
go way up to the point the system is not useful.  I have verified this in
hostap 0.0.4.

We have verified this with two separate systems w/ different hardware.

It has to do with adding the wds device to the bridge.  You can add X number
of links manually through
iwpriv wlan0 wds_add 00:xx:xx:xx:xx:xx and the bridge stays up fine.  Once
you actually add these to the bridge group and activate
them do we have a problem.  It is on the 11th wds device that it always
crashes in the bridge.  Latency goes from 3ms to 500ms + and packet
loss.  However, while the one card does not work in the bridge that we add
the 11th wds link on; the other card sitting in the same bridge
still works fine.  It makes me believe this is a hostap issue because the
bridge still functions on the other card just fine.  If I do iwpriv wlan0
wds_del mac
and remove the device from the bridge things return back to normal.

I am not sure at this point whether this is a bridge issue ( probably not
since bridge hasn't been changed in 1.5 years ) or in hostap.

I have been trying to isolate the event down to a smaller portion of the
code and even looked at the bridge code to find nothing that refernces
10 devices as a buffer in the spanning tree.  According to the ethernet
standard RFC a bridge should be able to have 256 devices or ports attached
w/o a problem.  This is the same for the bridge-utils package and the RFC
for spanning tree protocol.

If anyone else could help shed light on this I would be appreciative.  I
have not been able to reproduce these exact results with hostap 0.0.3 which
makes me believe it is hostap 0.0.4.  I will do some more testing with 0.0.3
and see if I can reproduce the results.

Brock Eastman

----- Original Message -----
From: "Jouni Malinen" <jkmaline at cc.hut.fi>
To: <hostap at shmoo.com>
Sent: Tuesday, August 05, 2003 8:12 PM
Subject: Workaround for some SMP stability issues - request for testing

> Thanks to all the testing Michael Vallaly did with couple of patches to
> Host AP driver, I think I now have quite a bit more information about
> the stability issues with hostap_pci on SMP systems. It looks like there
> are two separate issues. 1) Something is corrupting the information
> transfered between card and driver which then results in frequent wlan
> hw resets. 2) hw_reset with hostap_pci on SMP system seems to hang the
> system completely in some cases (completely enough to even prevent NMI
> watchdog from detecting hang).
>
> It looks like the major cause for frequent resets was in fid register
> (RX, TX, TX Error, AllocFid) getting corrupted. I do not have any
> explanation for this apart from hardware/firmware bug. It looks like
> consecutive reads of the fid registers results in different results even
> though that register should not change before the event is acknowledged.
>
> I added code that will try its best to make sure that the fid register
> read will return correct fid number. This workaround will read the
> register three times and will return the value only if at least two of
> the reads returns the same value. This will be repeated up to five
> times.
>
> The workaround seemed to eliminate more or less all corrupted fid values
> in the test system. Consequently, there was no need to reset the
> hardware and no host system hangs.
>
> I added the workaround and test code into CVS and I would like to ask
> people with SMP systems to test the CVS snapshot version and report what
> they see in kernel log ('dmesg' output) and whether they see any system
> hangs or in general changes to the previous versions of the driver.
>
> In addition to the fid read workaround, I also changed TX fid array
> handling to use heavier locking. Previously used locking may have been
> insufficient on SMP systems. This change alone was also able to reduce
> the number of hardware resets, but this may have been do to changed
> timing (added latency due to heavier locking). Anyway, fid reads were
> still producing corrupted results.
>
>
> So far, I have mostly concentrated on findind out what is causing the
> resets. I will next try to figure out what could be done about the
> hw_reset and its relation to complete host system lockup.
>
> --
> Jouni Malinen                                            PGP id EFC895FA
> _______________________________________________
> HostAP mailing list
> HostAP at shmoo.com
> http://lists.shmoo.com/mailman/listinfo/hostap
>