[PATCH v6 00/10] block atomic writes

Luis Chamberlain mcgrof at kernel.org
Thu Apr 11 12:07:40 PDT 2024


On Wed, Apr 10, 2024 at 09:34:36AM +0100, John Garry wrote:
> On 08/04/2024 18:50, Luis Chamberlain wrote:
> > I agree that when you don't set the sector size to 16k you are not forcing the
> > filesystem to use 16k IOs, the metadata can still be 4k. But when you
> > use a 16k sector size, the 16k IOs should be respected by the
> > filesystem.
> > 
> > Do we break BIOs to below a min order if the sector size is also set to
> > 16k?  I haven't seen that and its unclear when or how that could happen.
> 
> AFAICS, the only guarantee is to not split below LBS.

It would be odd to split a BIO given a inode requirement size spelled
out, but indeed I don't recall verifying this gaurantee.

> > At least for NVMe we don't need to yell to a device to inform it we want
> > a 16k IO issued to it to be atomic, if we read that it has the
> > capability for it, it just does it. The IO verificaiton can be done with
> > blkalgn [0].
> > 
> > Does SCSI*require*  an 16k atomic prep work, or can it be done implicitly?
> > Does it need WRITE_ATOMIC_16?
> 
> physical block size is what we can implicitly write atomically.

Yes, and also on flash to avoid read modify writes.

> So if you
> have a 4K PBS and 512B LBS, then WRITE_ATOMIC_16 would be required to write
> 16KB atomically.

Ugh. Why does SCSI requires a special command for this?

Now we know what would be needed to bump the physical block size, it is
certainly a different feature, however I think it would be good to
evaluate that world too. For NVMe we don't have such special write
requirements.

I put together this kludge with the last patches series of LBS + the
bdev cache aops stuff (which as I said before needs an alternative
solution) and just the scsi atomics topology + physical block size
change to easily experiment to see what would break:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20240408-lbs-scsi-kludge

Using a larger sector size works but it does not use the special scsi
atomic write.

> > > To me, O_ATOMIC would be required for buffered atomic writes IO, as we want
> > > a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.
> > Would using the same min and max order for the inode work instead?
> 
> Maybe, I would need to check further.

I'd be happy to help review too.

  Luis



More information about the Linux-nvme mailing list