Performance loss with word count benchmark

Sat Oct 14 01:14:49 PDT 2023

On Sat, 14 Oct 2023, Matthew Wilcox wrote:

> On Fri, Oct 13, 2023 at 11:40:41PM +0200, Julia Lawall wrote:
> > I tried some more recent versions, with interesting results.  This is on a
> > four-socket Intel 6130.
> >
> > wordcount_yeti-4_5.19.0_performance.json:      "mean": 3.12326750034,
> > wordcount_yeti-4_6.0.0_performance.json:      "mean": 3.08271866984,
> > wordcount_yeti-4_6.1.0_performance.json:      "mean": 3.9168611305299996,
> > wordcount_yeti-4_6.2.0_performance.json:      "mean": 3.9323236962599992,
> > wordcount_yeti-4_6.3.0_performance.json:      "mean": 3.9299195262349995,
> > wordcount_yeti-4_6.4.0_performance.json:      "mean": 1.6312004133800002,
> > wordcount_yeti-4_6.5.0_performance.json:      "mean": 1.6477082927600002,
> > wordcount_yeti-4_6.6.0rc1_performance.json:      "mean": 0.9028324600200002,
> > wordcount_yeti-4_6.6.0rc3_performance.json:      "mean": 0.8936725624550004,
> >
> > Bisecting between 6.3 nd 6.4 gives the following list of commits:
> >
> > c7f8f31c00d1 mm: separate vma->lock from vm_area_struct =========> 1.6 seconds
> > 0d2ebf9c3f78 mm/mmap: free vm_area_struct without call_rcu in exit_mmap
> > 70d4cbc80c88 (HEAD) powerc/mm: try VMA lock-based page fault handling first
> > cd7f176aea5f arm64/mm: try VMA lock-based page fault handling first
> > 0bff0aaea03e x86/mm: try VMA lock-based page fault handling first
> > 52f238653e45 mm: introduce per-VMA lock statistics ==============> 3.9 seconds
>
> OK, I think I understand what's going on.
>
> The maple tree (like the AVL tree) is slower to modify than the RB tree.
> So when an mprotect() is running, it takes the mmap_lock for write and
> prevents all faults from being satisfied.  Because one thread is holding
> off all the others for longer, performance goes down.
>
> The commits you've found avoid taking the mmap_lock for read in order
> to handle a page fault.  So it (mostly) doesn't matter that the maple
> tree takes longer to modify.
>
> > There is another big performance improvement in 6.6-rc1.  After
> > c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f, the graphe still has the towers
> > of mprotects, as shown in the attached graph.  In 6.6.0-rc3, the towers
> > are gone, and the mprotects come at different times on different cores.
> > This surely reduces contention and improves performance.  But I haven't
> > bisected that case to see what causes the changed behavior.
>
> We drizzled changes in to increase the number of cases we could handle
> without the mmap_lock.  So you're probably seeing the results of some
> of those changes.
>
> You might be interested in testing the range between
> 29a22b9e08d7 and 350f6bbca1de .  Perhaps start with the parent of
> 350f6bbca1de and walk forward (although there may be build issues ...
> I think Andrew screwed up merging Suren's patchset and my patchset;
> mine was supposed to come after Suren's, but he reordered them).

I tried poking around in that range, but I kept running into compiler
errors.  But the improvement is at least somewhere in that range.

julia