[PATCH v7 0/3] FDP and per-io hints

Javier Gonzalez javier.gonz at samsung.com
Thu Oct 10 05:27:33 PDT 2024


On 10.10.2024 12:46, Hans Holmberg wrote:
>On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <javier.gonz at samsung.com> wrote:
>>
>> On 10.10.2024 08:40, Hans Holmberg wrote:
>> >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz at samsung.com> wrote:
>> >>
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Hans Holmberg <hans at owltronix.com>
>> >> > Sent: Tuesday, October 8, 2024 12:07 PM
>> >> > To: Javier Gonzalez <javier.gonz at samsung.com>
>> >> > Cc: Christoph Hellwig <hch at lst.de>; Jens Axboe <axboe at kernel.dk>; Martin K.
>> >> > Petersen <martin.petersen at oracle.com>; Keith Busch <kbusch at kernel.org>;
>> >> > Kanchan Joshi <joshi.k at samsung.com>; hare at suse.de; sagi at grimberg.me;
>> >> > brauner at kernel.org; viro at zeniv.linux.org.uk; jack at suse.cz; jaegeuk at kernel.org;
>> >> > bcrl at kvack.org; dhowells at redhat.com; bvanassche at acm.org;
>> >> > asml.silence at gmail.com; linux-nvme at lists.infradead.org; linux-
>> >> > fsdevel at vger.kernel.org; io-uring at vger.kernel.org; linux-block at vger.kernel.org;
>> >> > linux-aio at kvack.org; gost.dev at samsung.com; vishak.g at samsung.com
>> >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
>> >> >
>> >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz at samsung.com>
>> >> > wrote:
>> >> > >
>> >> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
>> >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> >> > > >> So, considerign that file system _are_ able to use temperature hints and
>> >> > > >> actually make them work, why don't we support FDP the same way we are
>> >> > > >> supporting zones so that people can use it in production?
>> >> > > >
>> >> > > >Because apparently no one has tried it.  It should be possible in theory,
>> >> > > >but for example unless you have power of 2 reclaim unit size size it
>> >> > > >won't work very well with XFS where the AGs/RTGs must be power of two
>> >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
>> >> > > >the capacity actually used.
>> >> > >
>> >> > > This is good. I think we should have at least a FS POC with data
>> >> > > placement support to be able to drive conclusions on how the interface
>> >> > > and requirements should be. Until we have that, we can support the
>> >> > > use-cases that we know customers are asking for, i.e., block-level hints
>> >> > > through the existing temperature API.
>> >> > >
>> >> > > >
>> >> > > >> I agree that down the road, an interface that allows hints (many more
>> >> > > >> than 5!) is needed. And in my opinion, this interface should not have
>> >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
>> >> > > >> put 100s of them to support storage node deployments. But this needs to
>> >> > > >> come from the users of the hints / zones / streams / etc,  not from
>> >> > > >> us vendors. We do not have neither details on how they deploy these
>> >> > > >> features at scale, nor the workloads to validate the results. Anything
>> >> > > >> else will probably just continue polluting the storage stack with more
>> >> > > >> interfaces that are not used and add to the problem of data placement
>> >> > > >> fragmentation.
>> >> > > >
>> >> > > >Please always mentioned what layer you are talking about.  At the syscall
>> >> > > >level the temperatur hints are doing quite ok.  A full stream separation
>> >> > > >would obviously be a lot better, as would be communicating the zone /
>> >> > > >reclaim unit / etc size.
>> >> > >
>> >> > > I mean at the syscall level. But as mentioned above, we need to be very
>> >> > > sure that we have a clear use-case for that. If we continue seeing hints
>> >> > > being use in NVMe or other protocols, and the number increase
>> >> > > significantly, we can deal with it later on.
>> >> > >
>> >> > > >
>> >> > > >As an interface to a driver that doesn't natively speak temperature
>> >> > > >hint on the other hand it doesn't work at all.
>> >> > > >
>> >> > > >> The issue is that the first series of this patch, which is as simple as
>> >> > > >> it gets, hit the list in May. Since then we are down paths that lead
>> >> > > >> nowhere. So the line between real technical feedback that leads to
>> >> > > >> a feature being merged, and technical misleading to make people be a
>> >> > > >> busy bee becomes very thin. In the whole data placement effort, we have
>> >> > > >> been down this path many times, unfortunately...
>> >> > > >
>> >> > > >Well, the previous round was the first one actually trying to address the
>> >> > > >fundamental issue after 4 month.  And then after a first round of feedback
>> >> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
>> >> > > >is the kinda of contributors I have a hard time taking serious.
>> >> > >
>> >> > > I am not sure I understand what you mean in the last sentece, so I will
>> >> > > not respond filling blanks with a bad interpretation.
>> >> > >
>> >> > > In summary, what we are asking for is to take the patches that cover the
>> >> > > current use-case, and work together on what might be needed for better
>> >> > > FS support. For this, it seems you and Hans have a good idea of what you
>> >> > > want to have based on XFS. We can help review or do part of the work,
>> >> > > but trying to guess our way will only delay existing customers using
>> >> > > existing HW.
>> >> >
>> >> > After reading the whole thread, I end up wondering why we need to rush the
>> >> > support for a single use case through instead of putting the architecture
>> >> > in place for properly supporting this new type of hardware from the start
>> >> > throughout the stack.
>> >>
>> >> This is not a rush. We have been supporting this use case through passthru for
>> >> over 1/2 year with code already upstream in Cachelib. This is mature enough as
>> >> to move into the block layer, which is what the end user wants to do either way.
>> >>
>> >> This is though a very good point. This is why we upstreamed passthru at the
>> >> time; so people can experiment, validate, and upstream only when there is a
>> >> clear path.
>> >>
>> >> >
>> >> > Even for user space consumers of raw block devices, is the last version
>> >> > of the patch set good enough?
>> >> >
>> >> > * It severely cripples the data separation capabilities as only a handful of
>> >> >   data placement buckets are supported
>> >>
>> >> I could understand from your presentation at LPC, and late looking at the code that
>> >> is available that you have been successful at getting good results with the existing
>> >> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
>> >> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
>> >> structures is great.
>> >
>> >No, we don't map data directly to zones using lifetime hints. In fact,
>> >lifetime hints contribute only a
>> >relatively small part  (~10% extra write amp reduction, see the
>> >rocksdb benchmark results).
>> >Segregating data by file is the most important part of the data
>> >placement heuristic, at least
>> >for this type of workload.
>>
>> Is this because RocksDB already does seggregation per file itself? Are
>> you doing something specific on XFS or using your knoledge on RocksDB to
>> map files with an "unwritten" protocol in the midde?
>
>Data placement by-file is based on that the lifetime of a file's data
>blocks are strongly correlated. When a file is deleted, all its blocks
>will be reclaimable at that point. This requires knowledge about the
>data placement buckets and works really well without any hints
>provided.

But we need hints to put files together. I believe you do this already,
as no placement protocol gives you unlimited separation.

>The life-time hint heuristic I added on top is based on rocksdb
>statistics, but designed to be generic enough to work for a wider
>range of workloads (still need to validate this though - more work to
>be done).

Maybe you can post some patches on the parts dedicated to the VFS level
and user-space API (syscall or uring)?

Following on the comment to Christoph, it would be good to have
something tangible to work together on for the next stage of this
support.

>
>>
>>     In this context, we have collected data both using FDP natively in
>>     RocksDB and using the temperatures. Both look very good, because both
>>     are initiated by RocksDB, and the FS just passes the hints directly
>>     to the driver.
>>
>> I ask this to understand if this is the FS responsibility or the
>> application's one. Our work points more to letting applications use the
>> hints (as the use-cases are power users, like RocksDB). I agree with you
>> that a FS could potentially make an improvement for legacy applications
>> - we have not focused much on these though, so I trust you insights on
>> it.
>
>The big problem as I see it is that if applications are going to work
>well together on the same media we need a common placement
>implementation somewhere, and it seems pretty natural to make it part
>of filesystems to me.

For FS users, makes a lot of sense. But we still need to cover
applications using raw block.




More information about the Linux-nvme mailing list