[PATCH v7 0/3] FDP and per-io hints

Hans Holmberg hans at owltronix.com
Thu Oct 10 03:46:53 PDT 2024


On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <javier.gonz at samsung.com> wrote:
>
> On 10.10.2024 08:40, Hans Holmberg wrote:
> >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz at samsung.com> wrote:
> >>
> >>
> >>
> >> > -----Original Message-----
> >> > From: Hans Holmberg <hans at owltronix.com>
> >> > Sent: Tuesday, October 8, 2024 12:07 PM
> >> > To: Javier Gonzalez <javier.gonz at samsung.com>
> >> > Cc: Christoph Hellwig <hch at lst.de>; Jens Axboe <axboe at kernel.dk>; Martin K.
> >> > Petersen <martin.petersen at oracle.com>; Keith Busch <kbusch at kernel.org>;
> >> > Kanchan Joshi <joshi.k at samsung.com>; hare at suse.de; sagi at grimberg.me;
> >> > brauner at kernel.org; viro at zeniv.linux.org.uk; jack at suse.cz; jaegeuk at kernel.org;
> >> > bcrl at kvack.org; dhowells at redhat.com; bvanassche at acm.org;
> >> > asml.silence at gmail.com; linux-nvme at lists.infradead.org; linux-
> >> > fsdevel at vger.kernel.org; io-uring at vger.kernel.org; linux-block at vger.kernel.org;
> >> > linux-aio at kvack.org; gost.dev at samsung.com; vishak.g at samsung.com
> >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> >> >
> >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz at samsung.com>
> >> > wrote:
> >> > >
> >> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
> >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> >> > > >> So, considerign that file system _are_ able to use temperature hints and
> >> > > >> actually make them work, why don't we support FDP the same way we are
> >> > > >> supporting zones so that people can use it in production?
> >> > > >
> >> > > >Because apparently no one has tried it.  It should be possible in theory,
> >> > > >but for example unless you have power of 2 reclaim unit size size it
> >> > > >won't work very well with XFS where the AGs/RTGs must be power of two
> >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
> >> > > >the capacity actually used.
> >> > >
> >> > > This is good. I think we should have at least a FS POC with data
> >> > > placement support to be able to drive conclusions on how the interface
> >> > > and requirements should be. Until we have that, we can support the
> >> > > use-cases that we know customers are asking for, i.e., block-level hints
> >> > > through the existing temperature API.
> >> > >
> >> > > >
> >> > > >> I agree that down the road, an interface that allows hints (many more
> >> > > >> than 5!) is needed. And in my opinion, this interface should not have
> >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
> >> > > >> put 100s of them to support storage node deployments. But this needs to
> >> > > >> come from the users of the hints / zones / streams / etc,  not from
> >> > > >> us vendors. We do not have neither details on how they deploy these
> >> > > >> features at scale, nor the workloads to validate the results. Anything
> >> > > >> else will probably just continue polluting the storage stack with more
> >> > > >> interfaces that are not used and add to the problem of data placement
> >> > > >> fragmentation.
> >> > > >
> >> > > >Please always mentioned what layer you are talking about.  At the syscall
> >> > > >level the temperatur hints are doing quite ok.  A full stream separation
> >> > > >would obviously be a lot better, as would be communicating the zone /
> >> > > >reclaim unit / etc size.
> >> > >
> >> > > I mean at the syscall level. But as mentioned above, we need to be very
> >> > > sure that we have a clear use-case for that. If we continue seeing hints
> >> > > being use in NVMe or other protocols, and the number increase
> >> > > significantly, we can deal with it later on.
> >> > >
> >> > > >
> >> > > >As an interface to a driver that doesn't natively speak temperature
> >> > > >hint on the other hand it doesn't work at all.
> >> > > >
> >> > > >> The issue is that the first series of this patch, which is as simple as
> >> > > >> it gets, hit the list in May. Since then we are down paths that lead
> >> > > >> nowhere. So the line between real technical feedback that leads to
> >> > > >> a feature being merged, and technical misleading to make people be a
> >> > > >> busy bee becomes very thin. In the whole data placement effort, we have
> >> > > >> been down this path many times, unfortunately...
> >> > > >
> >> > > >Well, the previous round was the first one actually trying to address the
> >> > > >fundamental issue after 4 month.  And then after a first round of feedback
> >> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> >> > > >is the kinda of contributors I have a hard time taking serious.
> >> > >
> >> > > I am not sure I understand what you mean in the last sentece, so I will
> >> > > not respond filling blanks with a bad interpretation.
> >> > >
> >> > > In summary, what we are asking for is to take the patches that cover the
> >> > > current use-case, and work together on what might be needed for better
> >> > > FS support. For this, it seems you and Hans have a good idea of what you
> >> > > want to have based on XFS. We can help review or do part of the work,
> >> > > but trying to guess our way will only delay existing customers using
> >> > > existing HW.
> >> >
> >> > After reading the whole thread, I end up wondering why we need to rush the
> >> > support for a single use case through instead of putting the architecture
> >> > in place for properly supporting this new type of hardware from the start
> >> > throughout the stack.
> >>
> >> This is not a rush. We have been supporting this use case through passthru for
> >> over 1/2 year with code already upstream in Cachelib. This is mature enough as
> >> to move into the block layer, which is what the end user wants to do either way.
> >>
> >> This is though a very good point. This is why we upstreamed passthru at the
> >> time; so people can experiment, validate, and upstream only when there is a
> >> clear path.
> >>
> >> >
> >> > Even for user space consumers of raw block devices, is the last version
> >> > of the patch set good enough?
> >> >
> >> > * It severely cripples the data separation capabilities as only a handful of
> >> >   data placement buckets are supported
> >>
> >> I could understand from your presentation at LPC, and late looking at the code that
> >> is available that you have been successful at getting good results with the existing
> >> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
> >> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
> >> structures is great.
> >
> >No, we don't map data directly to zones using lifetime hints. In fact,
> >lifetime hints contribute only a
> >relatively small part  (~10% extra write amp reduction, see the
> >rocksdb benchmark results).
> >Segregating data by file is the most important part of the data
> >placement heuristic, at least
> >for this type of workload.
>
> Is this because RocksDB already does seggregation per file itself? Are
> you doing something specific on XFS or using your knoledge on RocksDB to
> map files with an "unwritten" protocol in the midde?

Data placement by-file is based on that the lifetime of a file's data
blocks are strongly correlated. When a file is deleted, all its blocks
will be reclaimable at that point. This requires knowledge about the
data placement buckets and works really well without any hints
provided.
The life-time hint heuristic I added on top is based on rocksdb
statistics, but designed to be generic enough to work for a wider
range of workloads (still need to validate this though - more work to
be done).

>
>     In this context, we have collected data both using FDP natively in
>     RocksDB and using the temperatures. Both look very good, because both
>     are initiated by RocksDB, and the FS just passes the hints directly
>     to the driver.
>
> I ask this to understand if this is the FS responsibility or the
> application's one. Our work points more to letting applications use the
> hints (as the use-cases are power users, like RocksDB). I agree with you
> that a FS could potentially make an improvement for legacy applications
> - we have not focused much on these though, so I trust you insights on
> it.

The big problem as I see it is that if applications are going to work
well together on the same media we need a common placement
implementation somewhere, and it seems pretty natural to make it part
of filesystems to me.


>
> >>
> >> >
> >> > * It just won't work if there is more than one user application per storage
> >> >   device as different applications data streams will be mixed at the nvme
> >> >   driver level..
> >>
> >> For now this use-case is not clear. Folks working on it are using passthru. When we
> >> have a more clear understanding of what is needed, we might need changes in the kernel.
> >>
> >> If you see a need for this on the work that you are doing, by all means, please send patches.
> >> As I said at LPC, we can work together on this.
> >>
> >> >
> >> > While Christoph has already outlined what would be desirable from a
> >> > file system point of view, I don't have the answer to what would be the overall
> >> > best design for FDP. I would like to say that it looks to me like we need to
> >> > consider more than more than the early adoption use cases and make sure we
> >> > make the most of the hardware capabilities via logical abstractions that
> >> > would be compatible with a wider range of storage devices.
> >> >
> >> > Figuring the right way forward is tricky, but why not just let it take the time
> >> > that is needed to sort this out while early users explore how to use FDP
> >> > drives and share the results?
> >>
> >> I agree that we might need a new interface to support more hints, beyond the temperatures.
> >> Or maybe not. We would not know until someone comes with a use case. We have made the
> >> mistake in the past of treating internal research as upstreamable work. I know can see that
> >> this simply complicates the in-kernel and user-space APIs.
> >>
> >> The existing API is usable and requires no changes. There is hardware. There are customers.
> >> There are applications with upstream support which have been tested with passthru (the
> >> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason
> >> not to take this in, and then we will see what new interfaces we might need in the future.
> >>
> >> I would much rather spend time in discussing ideas with you and others on a potential
> >> future API than arguing about the validity of an _existing_ one.
> >>
> >
> >Yes, but while FDP support could be improved later on(happy to help if
> >that'll be the case),
> >I'm just afraid that less work now defining the way data placement is
> >exposed is going to
> >result in a bigger mess later when more use cases will be considered.
>
> Please, see the message I responded on the other thread. I hope it is a
> way to move forward and actually work together on this.



More information about the Linux-nvme mailing list