[RFC] Reinstate NFS exportability for JFFS2.

Sun Aug 3 21:03:27 EDT 2008

On Sunday August 3, chuck.lever at oracle.com wrote:
> On Sun, Aug 3, 2008 at 7:56 AM, Neil Brown <neilb at suse.de> wrote:
> > On Saturday August 2, bfields at fieldses.org wrote:
> >>
> >> Though really I can't see any great objection to just moving xfs's hack
> >> up into nfsd.  It may not do everything, but it seems like an
> >> incremental improvement.
> >
> > Because it is a hack, and hacks have a tendency to hide deeper
> > problems, and not be ever get cleaned up and generally to become a
> > burden to future generations.
> 
> Agreed that maintainability is an important concern.
> 
> However, I don't see that what David suggests in general is hiding a
> deeper problem, but is rather exposing it.  Can you explain what you
> think is the issue?

The locking bothers me.
The VFS seems to have a general policy of doing the locking itself to
make life easier for filesystem implementors (or to make it harder for
them to mess up, depending on your perspective).

The current issue seems to suggest that the locking provided by the
VFS is no longer adequate, so each filesystem is needing to create
something itself.  That suggests to me a weakness in the model.
Possibly the VFS should give up trying to be in control and let
filesystems do their own locking.  Possibly there are still things
that the VFS can do which are universally good.  I think these are
questions that should be addressed.  
Maybe they have already been addressed and I missed the conversation
(that wouldn't surprise me much).  But seeing words like "hack"
suggests to me that it hasn't.  So I want to make sure I understand
the problem properly and deeply before giving my blessing to a hack.

So: what are the issues?

 Obviously readdir can race with create and you don't want them
 tripping each other up.  The current VFS approach to this is i_mutex.
 Any place which modifies a directory or does a lookup in a directory
 takes i_mutex to ensure that the directory doesn't change.

 This is probably fairly limiting.  With a tree-structured directory
 you only really need to lock the 'current node' of the tree.
 So any access would lock the top node, find which child to follow,
 then lock the child and unlock the parent.  Rebalancing might need to
 be creative as you cannot lock a parent while holding a lock on the
 child, but that isn't insurmountable.
 So I could argue that holding i_mutex across a lookup or create or
 readdir maybe isn't ideal.  It would be interesting to explore the
 possibility of dropping i_mutex across calls into the filesystem.
 By the time the filesystem is called, we really only need to be
 locking the names (dentries) rather than the whole directory.
 More on this later...

 Some filesystems want to restructure directories and times that are
 independent of any particular access.  This might be defragmentation
 or rebalancing or whatever.  Clearly there needs to be mutual
 exclusion between this and other accesses such as readdir and lookup.
 The VFS clearly cannot help with this.  It doesn't know when
 rebalancing might happen or are what sort of granularity.  So the
 filesystem must do it's own locking.
 It strikes me that this sort of locking would best be done by
 spinlocks at the block level rather than a per-inode mutex.  However
 I haven't actually implemented this (yet) so I cannot be certain.

 This is what is causing the current problem (for JFFS2 at least).
 JFF2 has a per-inode lock which protects against internally visible
 changes to the inode.  Further (and this is key) this lock is held
 across the filldir callback.
 i_mutex is also held across the filldir callback, but there is an
 important difference.  It is taken by the VFS, not the filesystem,
 and it is guaranteed always to be held across the filldir callback.
 So the filldir callback can call e.g. lookup without further locking.

 Backing up a little:  given that the filesystem implementor chose to
 use per-inode locking to protect internal restructuring (which is
 certainly an easier option), why not just use i_mutex?  The reason
 is that a create operation might trigger system-wide garbage
 collection which could trigger restructuring of the current inode,
 which would lead to an A-A deadlock (as the create is waiting for the
 garbage collection, and so still holding i_mutex).

So, given that background, it is possible to see some more possible
solutions (other than the ones already mentioned).

 - drop the internal lock across filldir.
   It could be seen a impolite to hold any locks across a callback
   that are not documented as being held.
   This would put an extra burden on the filesystem, but it shouldn't
   be a particularly big burden.
   A filesystem needs to be able to find the 'next' entry from a
   sequential 'seek' pointer so that is the most state that needs to
   be preserved.  It might be convenient to be able to keep more state
   (pointers into data structures etc).  All this can be achieved with
   fairly standard practice:
     a/ have a 'version' number per inode which is updated when any
        internal restructure happens.
     b/ when calling filldir, record the version number, drop the lock
        call filldir, reclaim the lock, check the version number
     c/ if the version number matches (as it mostly will) just keep
        going.  If it doesn't jump back to the top of readdir where
        we decode the current seek address.

   Some filesystems have problems implementing seekdir/telldir so they
   might not be totally happy here.  I have little sympathy for such
   filesystems and feel the burden should be on them to make it work.

 - use i_mutex to protect internal changes too, and drop i_mutex while
   doing internal restructuring.   This would need some VFS changes so
   that dropping i_mutex would be safe.  It would require some way to
   lock an individual dentry.  Probably you would lock it under
   i_mutex by setting a flag bit, wait for the flag on some inode-wide
   waitqueue, and drop the lock by clearing the flag and waking the
   waitqueue. And you are never allowed to look at ->d_inode if the
   lock flag is set.

Of these I really like the second.  Refining the i_mutex down to a
per-name lock before calling in to the filesystem seems like a really
good idea and should be good for scalability and large directories.
It isn't something to be done lightly though.  The filesystem would
still be given i_mutex held, but would be allowed to drop it if it
wanted to.  We could have an FS_DROPS_I_MUTEX similar to the current
FS_RENAME_DOES_D_MOVE.

For the first, I really like the idea that a lock should not be held
across calls the filldir.  I feel that a filesystem doing that is
"wrong" in much the same way that some think that recursing into the
filesystem as nfsd does is "wrong".  In reality neither is "wrong", we
just need to work together and negotiate and work out the best way to
meet all of our needs.

So what should we do now?  I think that for JFFS2 to drop and reclaim
f->sem in readdir would be all of 20 lines of code, including updating
a ->version counter elsewhere in the code.  Replicating that in all
the filesystems that need it would probably be more code than the nfsd
change though.

On the other hand, if we implement the nfsd change, it will almost
certainly never go away, even if all filesystems eventually find that
they don't need it any more because someone improves the locking
rules in the VFS.  Where as the code in the filesystems could quite
possibly go away when they are revised to make better use of the
locking regime.  So I don't think that is an ideal situation either.

If I had time, I would work on modifying the VFS to allow filesystems
to drop i_mutex.  However I don't have time at the moment, so I'll
leave the decision to be made by someone else  (Hi Bruce!  I'll
support whatever you decide).

But I think it is important to understand what is really going on and
not just implement a hack that works around the problem.  I think I do
understand now, so I am a lot happier.  Hopefully you do too.

NeilBrown