[EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams

Christoph Hellwig hch at lst.de
Wed Nov 20 23:17:48 PST 2024


On Wed, Nov 20, 2024 at 11:11:12AM -0700, Keith Busch wrote:
> Various applications were written to that interface and showed initial
> promise, but production quality hardware never materialized.

FYI, production grade NVMe streams hardware did materialize and is still
is shipped and used by multiple storage OEMS.  Like most things in
enterprise storage you're unlikely to see it outside the firmware builds
for those few customers that actually require and QAed it.

> Some of
> these applications are still setting the write hints today, and the
> filesystems are all passing through the block stack, but there's just
> currently no nvme driver listening on the other side.

The only source available application we could fine that is using these
hints is rocksb, which got the fcntl interface wrong so that it did not
have a chance to actually work until Hans fixed it recently.  Once he
fixed it, it shows great results when using file system based hinting,
although it still needs tuning to align it's internal file size to
the hardware reclaim unit size, i.e. it either needs behind the scenes
knowledge or an improved interface to be properly optimized.

> The meaning assigned to an FDP stream is whatever the user wants it to
> mean. It's not strictly a lifetime hint, but that is certainly a valid
> way to use them. The contract on the device's side is that writes to
> one stream won't create media interfere or contention with writes to
> other streams. This is the same as nvme's original streams, which for
> some reason did not carry any of this controversy.

Maybe people realized how badly that works outside a few very special
purpose uses?

I've said it before, but if you really do want to bypass the file
systems (and there's very good reasons for that for some workloads),
bypass it entirely.  Don't try to micro-manage the layer below the
file system from the application without any chance for the file system
to even be in the known.

The entire discussion also seems to treat file systems as simple
containers for blocks that are static.  While that is roughly true
for a lot of older file system designs, once you implement things
like snapshots, data checksums, data journalling or in general
flash friendly metadata write patterns that is very wrong, and
the file systems will want to be able to separate write streams
independently of the pure application write streams.



More information about the Linux-nvme mailing list