[RFCv2 0/3] mac80211: implement fq codel

Mon Mar 21 04:57:04 PDT 2016

On 17 March 2016 at 18:00, Dave Taht <dave.taht at gmail.com> wrote:
> On Thu, Mar 17, 2016 at 1:55 AM, Michal Kazior <michal.kazior at tieto.com> wrote:
>
>> I suspect the BK/BE latency difference has to do with the fact that
>> there's bulk traffic going on BE queues (this isn't reflected
>> explicitly in the plots). The `bursts` flent test includes short
>> bursts of traffic on tid0 (BE) which is shared with ICMP and BE UDP_RR
>> (seen as green and blue lines on the plot). Due to (intended) limited
>> outflow (6mbps) BE queues build up and don't drain for the duration of
>> the entire test creating more opportunities for aggregating BE traffic
>> while other queues are near-empty and very short (time wise as well).
>
> I agree with your explanation. Access to the media and queue length
> are the two variables at play here.
>
> I just committed a new flent test that should exercise the vo,vi,be,
> and bk queues, "bursts_11e". I dropped the conventional ping from it
> and just rely on netperf's udp_rr for each queue. It seems to "do the
> right thing" on the ath9k....
[...]
> I long for regular "rrul" and "rrul_be" tests against the new stuff to
> blow it up thoroughly as references along the way.
> (tcp_upload, tcp_download, (and several of the rtt_fair tests also
> between stations)). Will get formal about it here as soon as we end up
> on the same kernel trees....
[...]
> simple example of the damage having all 4 queues always contending is
> exemplified by running the rrul and rrul_be tests against nearly any
> given AP.

Thanks! I've run more tests and am attaching results.

A couple of words on the test naming:
 - "fast" means 1x1 station with good RF conditions
 - "slow" means 1x1 station with bad RF conditions (antenna unplugged)
 - "fast+slow" means traffic is directed to both "fast" and "slow" stations
 - "verfast" means 4x4 station for peak tput measurement
 - "autorate" means rate control is enabled
 - "rate6m" means 6mbps fixed tx rate on DUT
 - the DUT is acting as AP in all tests
 - other devices in the setup *do not* have any extra patches (so
bidirectional tests must be carefully analyzed)
 - 4 sets of software patches:
   - fullpatch contains all codel patches (mac80211+ath10k)
   - macpatch contains only mac80211 changes (so ath10k at least gets
to use per-txq fq-codel like queuing)
   - pre-waketx is ath10k with some patches reverted (before
pull-push/wake-tx-queue stuff was applied)
   - waketx is current ath10k (i.e. with simple wake_tx_queue implementation)

Observations/ notes:
 - "slow" case proves my naive get_expected_throughput() for ath10k is
highly inaccurate due to not considering retries. because of that
latency gets bad as mac80211's tx scheduling is queuing up more than
necessary; ath9k should do a lot better with minstrel
 - i kept netperf2.6 (which has no udp-rr recovery) for now as it's
easier to spot glitches

Please let me know if you see anything interesting or worrying in these plots.

>> I've modified traffic-gen and re-run tests with bursts on all tested
>> tids/ACs (tid0, tid1, tid5). I'm attaching the results.
>>
>> With bursts on all tids you can clearly see BK has much higher latency than BE.
>
> The long term goal here, of course, is for BK (or the other queues) to
> not have seconds of queuing latency but something more bounded to 2x
> media access time...

My patch already tries to maintain txop-based in-flight tx queue
depth. Current defaults are to keep between 3-4 txops per hardware and
roughly 2txops per tid. You could argue these are too big but I wanted
to keep them conservative, at least initially, to make sure to not
affect peak throughput badly. All of these are knobs you can play with
via debugfs.

This requires drivers to use ieee80211_tx_schedule(). If driver merely
uses wake_tx_queue it will only benefit from flow fairness (albeit
limited) but it will not keep queues at N txop fill level (unless
driver does that on it's own).

This means that Tim's ath9k patch will need to be adjusted a bit to
make use of this new API prototype for full effect. Unfortunately I
didn't have time to play on this front yet.

>> (Note, I've changed my AP to QCA988X with oldie firmware 10.1.467 for
>> this test; it doesn't have the weird hiccups I was seeing on QCA99X0
>> and newer QCA988X firmware reports bogus expected throughput which is
>> most likely a result of my sloppy proof-of-concept change in ath10k).
>
> So I should avoid ben greer's firmware for now?

I'm guessing his 10.1 fork should work fine. Not sure about the 10.2.4 though.

Anyway, keep in mind you'll get mixed results with ath10k. The
throughput estimation I've done for now is an ugly hack. It works in
fixed-rate conditions (which I use to prove a point that given
adequate rate estimation you can keep fw/hw tx queues at a reasonable
latency). It doesn't consider tx retries and unstable RF conditions
(rate control is in firmware and there's limited information available
to the driver) though which leads to more frames being queued than
necessary (and therefore increasing latency). This becomes apparent
with real-life interference and tx retries (just compare
"autorate,slow" against "rate6m,fast").

ath9k should do a lot better job at this (although that requires Tim's
patches; I haven't tested that myself) because it uses minstrel which
and should predict throughput a lot more reliably.

Michał
-------------- next part --------------
A non-text attachment was scrubbed...
Name: flent-2016-03-21.tar.gz
Type: application/x-gzip
Size: 2029295 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/ath10k/attachments/20160321/daa8d1c5/attachment-0001.bin>