[LSF/MM/BPF TOPIC] Large block for I/O

Luis Chamberlain mcgrof at kernel.org
Thu Feb 22 10:45:25 PST 2024


On Mon, Jan 08, 2024 at 07:35:17PM +0000, Matthew Wilcox wrote:
> On Mon, Jan 08, 2024 at 11:30:10AM -0800, Bart Van Assche wrote:
> > On 12/21/23 21:37, Christoph Hellwig wrote:
> > > On Fri, Dec 22, 2023 at 05:13:43AM +0000, Matthew Wilcox wrote:
> > > > It clearly solves a problem (and the one I think it's solving is the
> > > > size of the FTL map).  But I can't see why we should stop working on it,
> > > > just because not all drive manufacturers want to support it.
> > > 
> > > I don't think it is drive vendors.  It is is the SSD divisions which
> > > all pretty much love it (for certain use cases) vs the UFS/eMMC
> > > divisions which tends to often be fearful and less knowledgeable (to
> > > say it nicely) no matter what vendor you're talking to.
> > 
> > Hi Christoph,
> > 
> > If there is a significant number of 4 KiB writes in a workload (e.g.
> > filesystem metadata writes), and the logical block size is increased from
> > 4 KiB to 16 KiB, this will increase write amplification no matter how the
> > SSD storage controller has been designed, isn't it? Is there perhaps
> > something that I'm misunderstanding?
> 
> You're misunderstanding that it's the _drive_ which gets to decide the
> logical block size. Filesystems literally can't do 4kB writes to these
> drives; you can't do a write smaller than a block.  If your clients
> don't think it's a good tradeoff for them, they won't tell Linux that
> the minimum IO size is 16kB.

Yes, but its perhaps good to review how flexible this might be or not.
I can at least mention what I know of for NVMe. Getting a lay of the
land of this for other storage media would be good.

Some of the large capacity NVMe drives have NPWG as 16k, that just means
the Indirection Unit is 16k, the mapping table, so the drive is hinting
*we prefer 16k* but you can still do 4k writes, it just means for all
these drives that a 4k write will be a RMW.

Users who *want* to help avoid RMWs on these drives and want to increase the
writes to be at least 16k can enable a 16k or larger block size so to
align the writes. The experimentation we have done using Daniel Gomez's
eBPF blkalgn tool [0] reveal (as discussed at last year's Plumbers) that
there were still some 4k writes, this was in turn determined to be due
to XFS's buffer cache usage for metadata. Dave recently posted patches to allow
to use large folios on the xfs buffer cache [1], and Daniel has started making
further observations on this which he'll be revealing soon.

[0] https://github.com/dagmcr/bcc/tree/blkalgn-dump
[1] https://lore.kernel.org/all/20240118222216.4131379-1-david@fromorbit.com/

For large capacity NVMe drives with large atomics (NAUWPF), the
nvme block driver will allow for the physical block size to be 16k too,
thus allowing the sector size to be set to 16k when creating the
filesystem, that would *optionally* allow for users to force the
filesystem to not allow *any* writes to the device to be 4k. Note
then that there are two ways to be able to use a sector size of 16k
for NVMe today then, one is if your drive supported 16 LBA format and
another is with these two parameters set to 16k. The later allows you
to stick with 512 byte or 4k LBA format and still use a 16k sector size.
That allows you to remain backward compatible.

Jan Kara's patches "block: Add config option to not allow writing to
mounted devices" [2] should allow us to remove the set_blocksize() call
in xfs_setsize_buftarg() since XFS does not use the block device cache
at all, and his pathces ensure once a filesystem is mounted userspace
won't muck with the block device directly.

As for the impact of this for 4k writes, if you create the filesystem
with a 16 sector size then we're strict, and it means at minimum 16k is
needed. It is no different than what is done for 4k where the logical
block size is 512 bytes and we use a 4k sector size as the physical
block size is 4k. If using buffered IO then we can leverage the page
cache for modifications. Either way, you should do your WAF homework
too. Even if you *do* have 4k workloads, underneath the hood you may see
that as of matter of fact the number of IOs which are 4k are very likely
small in count. In so far as WAF is concerned, the *IO volume* is what
matters.  Luca Bert has a great write up on his team's findings when
evaluating some real world workload's WAF estimates when considering
IO volume [3].

[2] https://lkml.kernel.org/r/20231101173542.23597-1-jack@suse.cz
[3] https://www.micron.com/about/blog/2023/october/real-life-workloads-allow-more-efficient-data-granularity-and-enable-very-large-ssd-capacities

We were not aware of public open source tools to do what they did,
so we worked on a tool that allows just that. You can measure your
workload WAF using Daniel Gomez's WAF tool for NVMe [4] and decide if
the tradeoffs are acceptable. It would be good for us to automate
generic workloads, slap it on kdevops, and compute WAF, for instance.

[4] https://github.com/dagmcr/bcc/tree/nvmeiuwaf

> Some workloads are better with a 4kB block size, no doubt.  Others are
> better with a 512 byte block size.  That doesn't prevent vendors from
> offering 4kB LBA size drives.

Indeed, using large block sizes by no not meant for all workloads. But
it's a good time to also remind folks that larger IOs tend to just be
good for flash storage in general too. So if your WAF measurements check
out, using large block sizes is something to evaluate.

 Luis



More information about the Linux-nvme mailing list