Performance loss with word count benchmark

Matthew Wilcox willy at infradead.org
Fri Oct 13 20:51:53 PDT 2023


On Fri, Oct 13, 2023 at 11:40:41PM +0200, Julia Lawall wrote:
> I tried some more recent versions, with interesting results.  This is on a
> four-socket Intel 6130.
> 
> wordcount_yeti-4_5.19.0_performance.json:      "mean": 3.12326750034,
> wordcount_yeti-4_6.0.0_performance.json:      "mean": 3.08271866984,
> wordcount_yeti-4_6.1.0_performance.json:      "mean": 3.9168611305299996,
> wordcount_yeti-4_6.2.0_performance.json:      "mean": 3.9323236962599992,
> wordcount_yeti-4_6.3.0_performance.json:      "mean": 3.9299195262349995,
> wordcount_yeti-4_6.4.0_performance.json:      "mean": 1.6312004133800002,
> wordcount_yeti-4_6.5.0_performance.json:      "mean": 1.6477082927600002,
> wordcount_yeti-4_6.6.0rc1_performance.json:      "mean": 0.9028324600200002,
> wordcount_yeti-4_6.6.0rc3_performance.json:      "mean": 0.8936725624550004,
> 
> Bisecting between 6.3 nd 6.4 gives the following list of commits:
> 
> c7f8f31c00d1 mm: separate vma->lock from vm_area_struct =========> 1.6 seconds
> 0d2ebf9c3f78 mm/mmap: free vm_area_struct without call_rcu in exit_mmap
> 70d4cbc80c88 (HEAD) powerc/mm: try VMA lock-based page fault handling first
> cd7f176aea5f arm64/mm: try VMA lock-based page fault handling first
> 0bff0aaea03e x86/mm: try VMA lock-based page fault handling first
> 52f238653e45 mm: introduce per-VMA lock statistics ==============> 3.9 seconds

OK, I think I understand what's going on.

The maple tree (like the AVL tree) is slower to modify than the RB tree.
So when an mprotect() is running, it takes the mmap_lock for write and
prevents all faults from being satisfied.  Because one thread is holding
off all the others for longer, performance goes down.

The commits you've found avoid taking the mmap_lock for read in order
to handle a page fault.  So it (mostly) doesn't matter that the maple
tree takes longer to modify.

> There is another big performance improvement in 6.6-rc1.  After
> c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f, the graphe still has the towers
> of mprotects, as shown in the attached graph.  In 6.6.0-rc3, the towers
> are gone, and the mprotects come at different times on different cores.
> This surely reduces contention and improves performance.  But I haven't
> bisected that case to see what causes the changed behavior.

We drizzled changes in to increase the number of cases we could handle
without the mmap_lock.  So you're probably seeing the results of some
of those changes.

You might be interested in testing the range between
29a22b9e08d7 and 350f6bbca1de .  Perhaps start with the parent of
350f6bbca1de and walk forward (although there may be build issues ...
I think Andrew screwed up merging Suren's patchset and my patchset;
mine was supposed to come after Suren's, but he reordered them).




More information about the maple-tree mailing list