[PATCHv10 9/9] scsi: set permanent stream count in block limits
Christoph Hellwig
hch at lst.de
Tue Oct 29 21:55:26 PDT 2024
On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > > is a placement id or not, and only nvme sets it, so scsi's attributes
> > > are not claiming to be a write stream.
> >
> > So it shows up in sysfs, but:
> >
> > - queue_max_write_hints (which really should be queue_max_write_streams)
> > still picks it up, and from there the statx interface
> >
> > - per-inode fcntl hint that encode a temperature still magically
> > get dumpted into the write streams if they are set.
> >
> > In other words it's a really leaky half-backed abstraction.
>
> Exactly why I asked last time: "who uses it and how do you want them to
> use it" :)
For the temperature hints the only public user I known is rocksdb, and
that only started working when Hans fixed a brown paperbag bug in the
rocksdb code a while ago. Given that f2fs interprets the hints I suspect
something in the Android world does as well, maybe Bart knows more.
For the separate write streams the usage I want for them is poor mans
zones - e.g. write N LBAs sequentially into a separate write streams
and then eventually discard them together. This will fit nicely into
f2fs and the pending xfs work as well as quite a few userspace storage
systems. For that the file system or application needs to query
the number of available write streams (and in the bitmap world their
numbers of they are distontigous) and the size your can fit into the
"reclaim unit" in FDP terms. I've not been bothering you much with
the latter as it is an easy retrofit once the I/O path bits lands.
> > Let's brainstorm how it could be done better:
> >
> > - the max_write_streams values only set by block devices that actually
> > do support write streams, and not the fire and forget temperature
> > hints. They way this is queried is by having a non-zero value
> > there, not need for an extra flag.
>
> So we need a completely different attribute for SCSI's permanent write
> streams? You'd mentioned earlier you were okay with having SCSI be able
> to utilized per-io raw block write hints. Having multiple things to
> check for what are all just write classifiers seems unnecessarily
> complicated.
I don't think the multiple write streams interface applies to SCSIs
write streams, as they enforce a relative temperature, and they don't
have the concept of how much you can write into an "reclaim unit".
OTOH there isn't much you need to query for them anyway, as the
temperature hints have always been defined as pure hints with all
up and downsides of that.
> No need to create a new fcntl. The people already testing this are
> successfully using FDP with the existing fcntl hints. Their applications
> leverage FDP as way to separate files based on expected lifetime. It is
> how they want to use it and it is working above expectations.
FYI, I think it's always fine and easy to map the temperature hits to
write streams if that's all the driver offers. It loses a lot of the
capapilities, but as long as it doesn't enforce a lower level interface
that never exposes more that's fine.
More information about the Linux-nvme
mailing list