[LSF/MM/BPF TOPIC] : Flexible Data Placement (FDP) availability for kernel space file systems
Dave Chinner
david at fromorbit.com
Wed Jan 17 13:51:37 PST 2024
On Wed, Jan 17, 2024 at 12:58:12PM +0100, Javier González wrote:
> On 16.01.2024 11:39, Viacheslav Dubeyko wrote:
> > > On Jan 15, 2024, at 8:54 PM, Javier González <javier.gonz at samsung.com> wrote:
> > > > How FDP technology can improve efficiency and reliability of
> > > > kernel-space file system?
> > >
> > > This is an open problem. Our experience is that making data placement
> > > decisions on the FS is tricky (beyond the obvious data / medatadata). If
> > > someone has a good use-case for this, I think it is worth exploring.
> > > F2FS is a good candidate, but I am not sure FDP is of interest for
> > > mobile - here ZUFS seems to be the current dominant technology.
> > >
> >
> > If I understand the FDP technology correctly, I can see the benefits for
> > file systems. :)
> >
> > For example, SSDFS is based on segment concept and it has multiple
> > types of segments (superblock, mapping table, segment bitmap, b-tree
> > nodes, user data). So, at first, I can use hints to place different segment
> > types into different reclaim units.
>
> Yes. This is what I meant with data / metadata. We have looked also into
> using 1 RUH for metadata and rest make available to applications. We
> decided to go with a simple solution to start with and complete as we
> see users.
XFS has an abstract type definition for metadata that is uses to
prioritise cache reclaim (i.e. classifies what metadata is more
important/hotter) and that could easily be extended to IO hints
to indicate placement.
We also have a separate journal IO path, and that is probably the
hotest LBA region of the filesystem (circular overwrite region)
which would stand to have it's own classification as well.
We've long talked about making use of write IO hints for separating
these things out, but requiring 10+ IO hint channels just for
filesystem metadata to be robustly classified has been a show
stopper. Doing nothing is almost always better than doing placement
hinting poorly.
> > Technically speaking, any file system can place different types of metadata in
> > different reclaim units. However, user data is slightly more tricky case. Potentially,
> > file system logic can track “hotness” or frequency of updates of some user data
> > and try to direct the different types of user data in different reclaim units.
*cough*
We already do this in the LBA space via the filesytsem allocators.
It's often configurable and generally called "allocation policies".
> > But, from another point of view, we have folders in file system namespace.
> > If application can place different types of data in different folders, then, technically
> > speaking, file system logic can place the content of different folders into different
> > reclaim units. But application needs to follow some “discipline” to store different
> > types of user data (different “hotness”, for example) in different folders.
Yup, XFS does this "physical locality is determined by parent
directory" separation by default (the inode64 allocation policy).
Every new directory inode is placed in a different allocation group
(LBA space) based on a rotor mechanism. All the files within that
directory are kept local to the directory (i.e. in the same AG/LBA
space) as much as possible.
Most filesystems have LBA locality policies like this because it is
highly efficient on physical seek latency limited storage hardware.
i.e. the storage hardware we've mostly been using since the early
1980s.
We could make allocation groups have different reclaim units,
but then we are talking about needing an arbitrary number of
different IO hints - XFS supports ~2^31 AGs if the filesystem is
large enough, and there's no way we're going to try to support that
many IO hints (software or hardware) in the foreseeable future.
IF devices want to try to classify related data themselves, then
using LBA locality internally to classify relationships below the
level of IO hints, then that would be a much closer match to how
filesystems have traditionally structured the data and metadata on
disk. Related data and metadata tends to get written to the same LBA
regions because that's the fastest way to access related and
metadata on seek-limited hardware.
Yeah, I know that these are SSDs we are talking about and they
aren't seek limited, but when we already have filesystem
implementations that try to clump related things to nearby LBA
spaces, it might be best to try to leverage this behaviour rather
than try to rely on kernel and userspace to correctly provide hints
about their data patterns.
Cheers,
Dave.
--
Dave Chinner
david at fromorbit.com
More information about the Linux-nvme
mailing list