[LSF/MM/BPF TOPIC] Large block for I/O

Viacheslav Dubeyko slava at dubeyko.com
Mon Dec 25 00:12:12 PST 2023



> On Dec 22, 2023, at 6:10 PM, Keith Busch <kbusch at kernel.org> wrote:
> 
> 

<skipped>

> 
> Other applications, though, still need 4k writes. Turning those to RMW
> on the host to modify 4k in the middle of a 16k block is obviously a bad
> fit.

So, if application doesn’t work with raw device directly or not use O_DIRECT,
then we always have file system’s page cache in the middle. It sounds like 4K
write operation makes dirty the whole 16K logical block, from file system point
of view. Finally, file system will need to flush the whole 16K logical block, even
if 4k modification was only in the middle of 16K. Potentially, it could sound like
increasing write amplification. However, usually, metadata could require smaller
granularity (like 4K). But metadata is frequently updated type of data. So, there is
significant probability that, at average, 16K logical block with metadata can be
evenly updated by 4K write operations before flush operation. If we have cold user
data, then logical block size doesn’t matter because write operation can be aligned.
I assume that frequently updated user data could be localized at some file’s area(s).
It means that 16K logical block size could gather several 4K frequently updated areas
Theoretically, it is possible to imagine really nasty even distribution of 4K updates
through the whole file with holes in between, but it looks like some stress testing or
benchmarking, but not real-life use-case or workload.

Let’s imagine that application writes directly to raw device by 4K I/O operations.
If block device supports 16K physical sector size, then can we write by 4K I/O
operations? From another point of view, if I know that my application updates by
4K I/O, then what’s the point to use device with 16K physical sector size, for example.
I hope we will have opportunity to make a choice between devices that supports 4K and
16K physical sector sizes. But, technically speaking, storage device usually receives
multiple I/O requests at the same time. Even if it receives 4K updates for different
LBAs, then it is possible to combine several 4K updates into 16K NAND flash page.
The question here is how to map the updates into LBAs efficiently. Because, the main
FTL’s responsibility is mapping (LBA into erase blocks, for example).

Thanks,
Slava.




More information about the Linux-nvme mailing list