[LSF/MM/BPF TOPIC] Large block for I/O

Viacheslav Dubeyko slava at dubeyko.com
Mon Dec 25 00:55:23 PST 2023



> On Dec 22, 2023, at 7:06 PM, Matthew Wilcox <willy at infradead.org> wrote:
> 
> On Fri, Dec 22, 2023 at 08:10:54AM -0700, Keith Busch wrote:
>> If the host really wants to write in small granularities, then larger
>> block sizes just shifts the write amplification from the device to the
>> host, which seems worse than letting the device deal with it.
> 
> Maybe?  I'm never sure about that.  See, if the drive is actually
> managing the flash in 16kB chunks internally, then the drive has to do a
> RMW which is increased latency over the host just doing a 16kB write,
> which can go straight to flash.  Assuming the host has the whole 16kB in
> memory (likely?)  Of course, if you're PCIe bandwidth limited, then a
> 4kB write looks more attractive, but generally I think drives tend to
> be IOPS limited not bandwidth limited today?
> 

Fundamentally, if storage device supports 16K physical sector size, then
I am not sure that we can write by 4K I/O requests. It means that we should
read 16K LBA into page cache or application’s buffer before any write
operation. So, I see potential RMW inside of storage device only if device
is capable to manage 4K I/O requests even if physical sector is 16K.
But is it real life use-case?

I am not sure about attractiveness of 4K write operations. Usually, file system
provides the way to configure an internal logical block size and metadata
granularities. Finally, it is possible to align the internal metadata and user data
granularities on 16K size, for example. An if we are talking about metadata
structures (for example, inodes table, block mapping, etc), then it’s frequently
updated data. So, 16K will most probably contains several updated 4K pieces.
And, as a result, we have to flush all these updated metadata, anyway, despite
PCIe bandwidth limitation (even if we have some). Also, I assume that to send
16K I/O request could be more beneficial that several 4K I/O requests. Of course,
real life is more complicated. 

Thanks,
Slava.




More information about the Linux-nvme mailing list