[PATCH net-next v1 02/13] net: wwan: tmi: Add buffer management

Wed Jan 11 05:37:10 PST 2023

Hello Sergey,

sorry for the late response, please check following reply.

On Sat, 2022-12-17 at 00:17 +0400, Sergey Ryazanov wrote:
> Hello Yanchao,
> 
> sorry for late response, please find some thoughts below.
> 
> On 09.12.2022 14:26, Yanchao Yang (杨彦超) wrote:
> > On Sun, 2022-12-04 at 22:58 +0400, Sergey Ryazanov wrote:
> > > On 22.11.2022 15:11, Yanchao Yang wrote:
> > > > From: MediaTek Corporation <linuxwwan at mediatek.com>
> > > > 
> > > > To malloc I/O memory as soon as possible, buffer management
> > > > comes
> > > > into being.
> > > > It creates buffer pools that reserve some buffers through
> > > > deferred
> > > > works when
> > > > the driver isn't busy.
> > > > 
> > > > The buffer management provides unified memory allocation/de-
> > > > allocation
> > > > interfaces for other modules. It supports two buffer types of
> > > > SKB
> > > > and page.
> > > > Two reload work queues with different priority values are
> > > > provided
> > > > to meet
> > > > various requirements of the control plane and the data plane.
> > > > 
> > > > When the reserved buffer count of the pool is less than a
> > > > threshold
> > > > (default
> > > > is 2/3 of the pool size), the reload work will restart to
> > > > allocate
> > > > buffers
> > > > from the OS until the buffer pool becomes full. When the buffer
> > > > pool fills,
> > > > the OS will recycle the buffer freed by the user.
> > > > 
> > > > Signed-off-by: Mingliang Xu <mingliang.xu at mediatek.com>
> > > > Signed-off-by: MediaTek Corporation <linuxwwan at mediatek.com>
> > > > ---
> > > >    drivers/net/wwan/mediatek/Makefile  |   3 +-
> > > >    drivers/net/wwan/mediatek/mtk_bm.c  | 369
> > > > ++++++++++++++++++++++++++++
> > > >    drivers/net/wwan/mediatek/mtk_bm.h  |  79 ++++++
> > > >    drivers/net/wwan/mediatek/mtk_dev.c |  11 +-
> > > >    drivers/net/wwan/mediatek/mtk_dev.h |   1 +
> > > >    5 files changed, 461 insertions(+), 2 deletions(-)
> > > >    create mode 100644 drivers/net/wwan/mediatek/mtk_bm.c
> > > >    create mode 100644 drivers/net/wwan/mediatek/mtk_bm.h
> > > 
> > > Yanchao, can you share some numbers, how this custom pool is
> > > outperform
> > > the regular kernel allocator?
> > 
> > Prepare 2 drivers *.ko for comparison.
> > Driver A (following named A):  enable pre-allocate buffer pool.
> > Driver B (following named A):  disenable pre-allocate buffer pool.
> > It
> > uses kernel API directly (__dev_alloc_skb and netdev_alloc_frag)
> > 
> > Test Instrument: Keysight UXM TA
> > Iperf command:
> > Server Command: iperf3 -s -p 5002 -i 1
> > Client Command: iperf3 -c 192.168.2.1 -p 5002 -i 1 -w 8M -t 30 -R
> > -P 5
> > 
> > Test result: Fig 1. A’s TCP DL throughput Fig 2. B’s TCP DL
> > throughput
> > (Ref attachment)
> > 
> >  From the results, it represents that the A’s IP packets throughput
> > reaches 7Gbits/sec, while B’s throughput is 4.7Gbits/sec. A’s
> > throughput is up about 50% compared with B.
> > 
> > In addition, from ftrace, it represents following results.
> > A: it takes 14.241828s for allocating 33211099 buffers. The average
> > time is about 0.4us.
> > B: it takes 7.677069s for allocating 10890789 buffers. The average
> > time
> > is about 0.7us.
> 
> Thank you for this impressive comparison test. There is something to 
> think about here.
> 
> In a common case, the kernel memory API is fast enough to guarantee 
> multi-gigabit throughput. So if some custom code outperforms it,
> then 
> either (a) you have found some corner case where the kernel memory
> API 
> is deadly slow and should be improved, or (b) there is something
> wrong 
> with a driver code. My point is that a driver should not implement 
> custom memory management since that leads to a driver complexity
> without 
> any real performance improvement.
> 
> The test shows the really significant difference between the custom 
> memory pool and the direct kernel API calling. So let's try to
> figure 
> out what is going on.
> 
> I assume that the control path (CLDMA) could not cause that much 
> performance degradation due to the low control messages traffic. So
> most 
> probably the root cause is somewhere in the data path (DPMAIF).
> Correct 
> me if my assumption is wrong.
> 
> Digging deeper into the driver code, I noticed that there actually
> two 
> types of pools (buffers). One pool type contains ready-made skbs,
> and 
> the other contains just page fragments. And both types of pools are 
> utilized in the data Rx path. Have you tried measuring which type of 
> pool improves performance more significantly?
> 
> I also noticed that neither allocated skb nor allocated page
> fragments 
> are freed in the DPMAIF code. So the improvement is not connected to 
> optimal caching (i.e. memory reuse). Thus memory allocation
> improvement 
> is most likely caused by avoiding of some contention.
> 
> The pool reload is performed in the context of work. And if I am not 
> mistaken, then skbs and fragments are also taken from preallocated
> pools 
> in the context of work to reinitialize the BAT (Rx) ring buffer.
> There 
> is no difference in the matter of priority. Both the pool reload and
> the 
> Rx ring buffer reload functions are called with the same priority on
> an 
> arbitrary CPU in the absence of other high priority tasks (e.g. 
> tasklets, irq). The only obvious difference is the invocation rate.
> The 
> pool reload operation is triggered as soon as the pool level falls
> below 
> the predefined threshold (currently 67%). While the Rx ring reload 
> operation is called on each NAPI poll. Have you considered
> introducing a 
> threshold similar to the pool reload threshold and calling the rx
> ring 
> reload less frequently?
> 
> --
> Sergey
The practice, pre-allocating SKB, is a mechanism that driver maintains
a buffer pool, where all buffers are the same length. The case of
sharing test result uses SKB buffer, (Page is not used in this test
scenario). About your query, both SKB and pages are released by kernel 
TCP/IP protocol stack. BTW, what¡¯s the purpose of reducing the
frequency of reloading BAT? There is more sharing test result.
Prepare 2
drivers *.ko for comparison.
Driver A (following named A):  enable pre-
allocate buffer pool.
Driver B (following named B):  disenable pre-
allocate buffer pool. It uses kernel API directly (__dev_alloc_skb and
netdev_alloc_frag)

The horizontal and vertical coordinates of the
following two graphs represent time and hardware available BAT counts,
respectively. Fig3.png shows A¡¯s available BAT partially enlarged
view. It presents that the trend of change is relatively stable.
Fig4.png shows B¡¯s available BAT partially enlarged view. It presents
that the trend of change is huge. From modem side, it leads to
reordering more data that affects throughput. However, considering the
test proof CPU loading, memory usage, and performance on low-
performance machines are not enough, we plan to remove this mechanism
in initial version and submit it again as a separate patch after the
initial version is accepted and merged.

many thanks.
yanchao.yang
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fig3.png
Type: image/png
Size: 45226 bytes
Desc: Fig3.png
URL: <http://lists.infradead.org/pipermail/linux-mediatek/attachments/20230111/fe9db771/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fig4.png
Type: image/png
Size: 93319 bytes
Desc: Fig4.png
URL: <http://lists.infradead.org/pipermail/linux-mediatek/attachments/20230111/fe9db771/attachment-0003.png>