msync in pmfs and other NVM filesystems

Matthew Wilcox willy at
Fri Feb 7 18:14:57 EST 2014

On Fri, Feb 07, 2014 at 01:59:18PM -0800, Dave Hansen wrote:
> On 02/07/2014 01:43 PM, Byan, Steve wrote:
> > I admit to having not read the code, but I'm hopeful that someone readily knows how, for a msync call, the kernel identifies the correct execute-in-place filesystem(s) to dispatch to for a range of memory. Suppose the application calls msync with a range that specifies more than one mapped file, and those two files are in different XIP filesystems?
> The code is in mm/msync.c.  It's fairly straightforward.
> Essentially, it walks the virtual addresses handed to the system call,
> finds all the parts that are file-backed, and calls vfs_fsync() on the
> _whole_ file.
> FWIW, it doesn't look that efficient or optimized, but it seems to do
> the trick.

Fixing this code is on my todo list.

For a start, it assumes that all the world uses the page cache, and that's
not true for XIP files.  So for MS_ASYNC, it is a rather expensive no-op
(that walks the VMAs in order to detect if the range covers some invalid

For MS_SYNC, it calls vfs_fsync(filp, 0), which writes the entire
file back (rather than just the specified range), and writes back
all the metadata, which isn't required.  It should instead be calling
vfs_fsync_range(filp, fstart, fend, 1) [1].

Next, we need to look at dirty bit tracking.  Linux used to (until 2.6.19)
walk the page tables to find all the dirty bits.  Now it reflects PTE
dirty bits into the struct page dirty bits elsewhere and relies on them
being up to date to know what to sync.  We don't have struct pages, so
we need to track which pages are dirty (in the CPU cache) and need to
be written back.  We don't want to have the VM scanning PM pages since
they don't need to be written back like DRAM pages do.  So we need a
common place to store the dirty bit that is the logical OR of all the
PTEs that map it (and we need to take care of clearing it once the page
is written back).

There's probably a few other nuances that I would remember if I didn't
have a stuffy head right now.

[1] Calculating fstart and fend isn't entirely trivial.  Also, we would
need to fall back to vfs_fsync(filp, 1) when VM_NONLINEAR is set.

More information about the Linux-pmfs mailing list