[PATCH] vfs: Don't evict inode under the inode lru traversing context

Jan Kara jack at suse.cz
Mon Aug 5 08:30:18 PDT 2024


On Mon 05-08-24 09:34:46, Zhihao Cheng wrote:
> From: Zhihao Cheng <chengzhihao1 at huawei.com>
> 
> The inode reclaiming process(See function prune_icache_sb) collects all
> reclaimable inodes and mark them with I_FREEING flag at first, at that
> time, other processes will be stuck if they try getting these inodes
> (See function find_inode_fast), then the reclaiming process destroy the
> inodes by function dispose_list(). Some filesystems(eg. ext4 with
> ea_inode feature, ubifs with xattr) may do inode lookup in the inode
> evicting callback function, if the inode lookup is operated under the
> inode lru traversing context, deadlock problems may happen.
> 
> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
>         if ea_inode feature is enabled, the lookup process will be stuck
> 	under the evicting context like this:
> 
>  1. File A has inode i_reg and an ea inode i_ea
>  2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>  3. Then, following three processes running like this:
> 
>     PA                              PB
>  echo 2 > /proc/sys/vm/drop_caches
>   shrink_slab
>    prune_dcache_sb
>    // i_reg is added into lru, lru->i_ea->i_reg
>    prune_icache_sb
>     list_lru_walk_one
>      inode_lru_isolate
>       i_ea->i_state |= I_FREEING // set inode state
>      inode_lru_isolate
>       __iget(i_reg)
>       spin_unlock(&i_reg->i_lock)
>       spin_unlock(lru_lock)
>                                      rm file A
>                                       i_reg->nlink = 0
>       iput(i_reg) // i_reg->nlink is 0, do evict
>        ext4_evict_inode
>         ext4_xattr_delete_inode
>          ext4_xattr_inode_dec_ref_all
>           ext4_xattr_inode_iget
>            ext4_iget(i_ea->i_ino)
>             iget_locked
>              find_inode_fast
>               __wait_on_freeing_inode(i_ea) ----→ AA deadlock
>     dispose_list // cannot be executed by prune_icache_sb
>      wake_up_bit(&i_ea->i_state)
> 
> Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
>         deleting process holds BASEHD's wbuf->io_mutex while getting the
> 	xattr inode, which could race with inode reclaiming process(The
>         reclaiming process could try locking BASEHD's wbuf->io_mutex in
> 	inode evicting function), then an ABBA deadlock problem would
> 	happen as following:
> 
>  1. File A has inode ia and a xattr(with inode ixa), regular file B has
>     inode ib and a xattr.
>  2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
>  3. Then, following three processes running like this:
> 
>         PA                PB                        PC
>                 echo 2 > /proc/sys/vm/drop_caches
>                  shrink_slab
>                   prune_dcache_sb
>                   // ib and ia are added into lru, lru->ixa->ib->ia
>                   prune_icache_sb
>                    list_lru_walk_one
>                     inode_lru_isolate
>                      ixa->i_state |= I_FREEING // set inode state
>                     inode_lru_isolate
>                      __iget(ib)
>                      spin_unlock(&ib->i_lock)
>                      spin_unlock(lru_lock)
>                                                    rm file B
>                                                     ib->nlink = 0
>  rm file A
>   iput(ia)
>    ubifs_evict_inode(ia)
>     ubifs_jnl_delete_inode(ia)
>      ubifs_jnl_write_inode(ia)
>       make_reservation(BASEHD) // Lock wbuf->io_mutex
>       ubifs_iget(ixa->i_ino)
>        iget_locked
>         find_inode_fast
>          __wait_on_freeing_inode(ixa)
>           |          iput(ib) // ib->nlink is 0, do evict
>           |           ubifs_evict_inode
>           |            ubifs_jnl_delete_inode(ib)
>           ↓             ubifs_jnl_write_inode
>      ABBA deadlock ←-----make_reservation(BASEHD)
>                    dispose_list // cannot be executed by prune_icache_sb
>                     wake_up_bit(&ixa->i_state)
> 
> Fix it by forbidding inode evicting under the inode lru traversing
> context. In details, we import a new inode state flag 'I_LRU_ISOLATING'
> to pin inode without holding i_count under the inode lru traversing
> context, the inode evicting process will wait until this flag is
> cleared from i_state.

Thanks for the patch and sorry for not getting to this myself!  Let me
rephrase the above paragraph a bit for better readability:

Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to
pin the inode in memory while inode_lru_isolate() reclaims its pages
instead of using ordinary inode reference. This way inode deletion cannot
be triggered from inode_lru_isolate() thus avoiding the deadlock. evict()
is made to wait for I_LRU_ISOLATING to be cleared before proceeding with
inode cleanup.

> @@ -488,6 +488,36 @@ static void inode_lru_list_del(struct inode *inode)
>  		this_cpu_dec(nr_unused);
>  }
>  
> +static void inode_lru_isolating(struct inode *inode)

Perhaps call this inode_pin_lru_isolating()

> +{
> +	BUG_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
> +	inode->i_state |= I_LRU_ISOLATING;
> +}
> +
> +static void inode_lru_finish_isolating(struct inode *inode)

And call this inode_unpin_lru_isolating()?

Otherwise the patch looks good so feel free to add:

Reviewed-by: Jan Kara <jack at suse.cz>

								Honza
-- 
Jan Kara <jack at suse.com>
SUSE Labs, CR



More information about the linux-mtd mailing list