[EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
Dave Chinner
david at fromorbit.com
Wed Nov 13 15:51:09 PST 2024
On Wed, Nov 13, 2024 at 05:47:36AM +0100, Christoph Hellwig wrote:
> On Tue, Nov 12, 2024 at 06:18:21PM +0000, Pierre Labat wrote:
> > About 2)
> > Provide a simple way to the user to decide which layer generate write hints.
> > As an example, as some of you pointed out, what if the filesystem wants to generate write hints to optimize its [own] data handling by the storage, and at the same time the application using the FS understand the storage and also wants to optimize using write hints.
> > Both use cases are legit, I think.
> > To handle that in a simple way, why not have a filesystem mount parameter enabling/disabling the use of write hints by the FS?
>
> The file system is, and always has been, the entity in charge of
> resource allocation of the underlying device. Bypassing it will get
> you in trouble, and a simple mount option isn't really changing that
> (it's also not exactly a scalable interface).
>
> If an application wants to micro-manage placement decisions it should not
> use a file system, or at least not a normal one with Posix semantics.
> That being said we'd demonstrated that applications using proper grouping
> of data by file and the simple temperature hints can get very good result
> from file systems that can interpret them, without a lot of work in the
> file system. I suspect for most applications that actually want files
> that is actually going to give better results than trying to do the
> micro-management that tries to bypass the file system.
This.
The most important thing that filesystems do behind the scenes is
manage -data locality-. XFS has thousands of lines of code to manage
and control data locality - the allocation policy API itself has a
*dozens* control parameters. We have 2 separate allocation
architectures (one btree based, one bitmap based) and multiple
locality policy algorithms. These juggled physical alignment, size
granularity, size limits, data type being allocated for, desired
locality targets, different search algorithms (e.g. first fit, best
fit, exact fit by size or location, etc), multiple fallback
strategies when the initial target cannot be met, etc.
Allocation policy management is the core of every block based
filesystem that has ever been written.
Specifically to this "stream hint" discussion: go look at the XFS
filestreams allocator.
SGI wrote an entirely new allocator for XFS whose only purpose in
life is to automatically separate individual streams of user data
into physically separate regions of LBA space.
This was written to optimise realtime ingest and playback of
multiple uncompressed 4k and 8k video data streams from big
isochronous SAN storage arrays back in ~2005. Each stream could be
up to 1.2GB/s of data. If the data for each IO was not exactly
placed in alignment with the storage array stripe cache granularity
(2MB, IIRC), then a cache miss would occur and the IO latency would
be too high and frames of data would be missed/dropped.
IOWs, we have an allocator in XFS that specifically designed to
separate indepedent streams of data to independent regions of the
filesystem LBA space to effcient support data IO rates in the order
of tens of GB/s.
What are we talking about now? Storage hardware that might be able
to do 10-15GB/s of IO that needs stream separation for efficient
management of the internal storage resources.
The fact we have previously solved this class of stream separation
problem at the filesystem level *without needing a user-controlled
API at all* is probably the most relevant fact missing from this
discussion.
As to the concern about stream/temp/hint translation consistency
across different hardware: the filesystem is the perfect place to
provide this abstraction to users. The block device can expose what
it supports, the user API can be fixed, and the filesystem can
provide the mapping between the two that won't change for the life
of the filesystem...
Long story short: Christoph is right.
The OS hints/streams API needs to be aligned to the capabilities
that filesystems already provide *as a primary design goal*. What
the new hardware might support is a secondary concern. i.e. hardware
driven software design is almost always a mistake: define the user
API and abstractions first, then the OS can reduce it sanely down to
what the specific hardware present is capable of supporting.
-Dave.
--
Dave Chinner
david at fromorbit.com
More information about the Linux-nvme
mailing list