xfs: open code inc_inode_iversion when logging an inode

Fri Nov 22 17:59:03 EST 2013

Gitweb:     http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2fe8c1c08b3fbd87dd2641c8f032ff6e965d5803
Commit:     2fe8c1c08b3fbd87dd2641c8f032ff6e965d5803
Parent:     8f80587bacb6eb893df0ee4e35fefa0dfcfdf9f4
Author:     Dave Chinner <dchinner at redhat.com>
AuthorDate: Fri Nov 1 15:27:17 2013 +1100
Committer:  Ben Myers <bpm at sgi.com>
CommitDate: Mon Nov 18 09:42:08 2013 -0600

    xfs: open code inc_inode_iversion when logging an inode
    
    Michael L Semon reported that generic/069 runtime increased on v5
    superblocks by 100% compared to v4 superblocks. his perf-based
    analysis pointed directly at the timestamp updates being done by the
    write path in this workload. The append writers are doing 4-byte
    writes, so there are lots of timestamp updates occurring.
    
    The thing is, they aren't being triggered by timestamp changes -
    they are being triggered by the inode change counter needing to be
    updated. That is, every write(2) system call needs to bump the inode
    version count, and it does that through the timestamp update
    mechanism. Hence for v5 filesystems, test generic/069 is running 3
    orders of magnitude more timestmap update transactions on v5
    filesystems due to the fact it does a huge number of *4 byte*
    write(2) calls.
    
    This isn't a real world scenario we really need to address - anyone
    doing such sequential IO should be using fwrite(3), not write(2).
    i.e. fwrite(3) buffers the writes in userspace to minimise the
    number of write(2) syscalls, and the problem goes away.
    
    However, there is a small change we can make to improve the
    situation - removing the expensive lock operation on the change
    counter update.  All inode version counter changes in XFS occur
    under the ip->i_ilock during a transaction, and therefore we
    don't actually need the spin lock that provides exclusive access to
    it through inc_inode_iversion().
    
    Hence avoid the lock and just open code the increment ourselves when
    logging the inode.
    
    Reported-by: Michael L. Semon <mlsemon35 at gmail.com>
    Signed-off-by: Dave Chinner <dchinner at redhat.com>
    Reviewed-by: Christoph Hellwig <hch at lst.de>
    Signed-off-by: Ben Myers <bpm at sgi.com>
---
 fs/xfs/xfs_trans_inode.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 1bba7f6..50c3f56 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -111,12 +111,14 @@ xfs_trans_log_inode(
 
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
-	 * counter if it is configured for this to occur.
+	 * counter if it is configured for this to occur. We don't use
+	 * inode_inc_version() because there is no need for extra locking around
+	 * i_version as we already hold the inode locked exclusively for
+	 * metadata modification.
 	 */
 	if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
 	    IS_I_VERSION(VFS_I(ip))) {
-		inode_inc_iversion(VFS_I(ip));
-		ip->i_d.di_changecount = VFS_I(ip)->i_version;
+		ip->i_d.di_changecount = ++VFS_I(ip)->i_version;
 		flags |= XFS_ILOG_CORE;
 	}