[PATCH v6 00/10] block atomic writes

Mon Apr 8 10:50:47 PDT 2024

On Fri, Apr 05, 2024 at 11:06:00AM +0100, John Garry wrote:
> On 04/04/2024 17:48, Matthew Wilcox wrote:
> > > > The thing is that there's no requirement for an interface as complex as
> > > > the one you're proposing here.  I've talked to a few database people
> > > > and all they want is to increase the untorn write boundary from "one
> > > > disc block" to one database block, typically 8kB or 16kB.
> > > > 
> > > > So they would be quite happy with a much simpler interface where they
> > > > set the inode block size at inode creation time,
> > > We want to support untorn writes for bdev file operations - how can we set
> > > the inode block size there? Currently it is based on logical block size.
> > ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
> > think we can remove that limitation with the bs>PS patches.

I can say a bit more on this, as I explored that. Essentially Matthew,
yes, I got that to work but it requires a set of different patches. We have
what we tried and then based on feedback from Chinner we have a
direction on what to try next. The last effort on that front was having the
iomap aops for bdev be used and lifting the PAGE_SIZE limit up to the
page cache limits. The crux on that front was that we end requiring
disabling BUFFER_HEAD and that is pretty limitting, so my old
implementation had dynamic aops so to let us use the buffer-head aops
only when using filesystems which require it and use iomap aops
otherwise. But as Chinner noted we learned through the DAX experience
that's not a route we want to again try, so the real solution is to
extend iomap bdev aops code with buffer-head compatibility.

> We want a consistent interface for bdev and regular files, so that would
> need to work for FSes also. FSes(XFS) work based on a homogeneous inode
> blocksize, which is the SB blocksize.

There are two aspects to this and it is important to differentiate them.

1) LBA formats used
2) When a large atomic is supported and you want to use smaller LBA formats

When the LBA format, and so logical block size is say 16k, the LBS
patches with the above mentioned patches enable IOs to the block device
to be atomic to say 16k.

But to remain flexible we want to support a world where 512 byte and 4k
LBA formats are still used, and you *optionally* want to leverage say
16k atomics. Today's block device topology enables this only with a knob
to userspace to allow userspace to override the sector size for the
filesystem. In practice today if you want to use 4k IOs you just format
the drive to use 4k LBA format. However, an alternative at laest for
NVMe today is to support say 16k atomic with an 4k or 512 LBA format.
This essentially *lifts* the physical block size to 16k while keeping
the logical block size at the LBA format, so 4k or 512 bytes. What you
*could* do with this, from the userspace side of things is at mkfs you
can *opt* in to use a larger sector size up to the physical block size.
When you do this the block device still has a logical block size of the
LBA format, but all IOs the filesystem would use use the larger sector
size you opted in for.

I suspect this is a use case where perhaps the max folio order could be
set for the bdev in the future, the logical block size the min order,
and max order the large atomic.

> Furthermore, we would seem to be mixing different concepts here. Currently
> in Linux we say that a logical block size write is atomic. In the block
> layer, we split BIOs on LBS boundaries. iomap creates BIOs based on LBS
> boundaries. But writing a FS block is not always guaranteed to be atomic, as
> far as I'm concerned.

True. To be clear above paragraph refers to LBS as logical block size.

However when a filesystem sets the min order, and it should be respected.
I agree that when you don't set the sector size to 16k you are not forcing the
filesystem to use 16k IOs, the metadata can still be 4k. But when you
use a 16k sector size, the 16k IOs should be respected by the
filesystem.

Do we break BIOs to below a min order if the sector size is also set to
16k?  I haven't seen that and its unclear when or how that could happen.

At least for NVMe we don't need to yell to a device to inform it we want
a 16k IO issued to it to be atomic, if we read that it has the
capability for it, it just does it. The IO verificaiton can be done with
blkalgn [0].

Does SCSI *require* an 16k atomic prep work, or can it be done implicitly?
Does it need WRITE_ATOMIC_16?

[0] https://github.com/dagmcr/bcc/tree/blkalgn

> So just increasing the inode block size / FS block size does not
> really change anything, in itself.

If we're breaking up IOs when a min order is set for an inode, that
would need to be looked into, but we're not seeing that.

> > Do untorn writes actually exist in SCSI?  I was under the impression
> > nobody had actually implemented them in SCSI hardware.
> 
> I know that some SCSI targets actually atomically write data in chunks >
> LBS. Obviously atomic vs non-atomic performance is a moot point there, as
> data is implicitly always atomically written.
> 
> We actually have an mysql/innodb port of this API working on such a SCSI
> target.

I suspect IO verification with the above tool should prove to show the
same if you use a filesystem with a larger sector size set too, and you
just would not have to do any changes to userspace other than the
filesystem creation with say mkfs.xfs params of -b size=16k -s size=16k

> However I am not sure about atomic write support for other SCSI targets.

Good to know!

> > > We saw untorn writes as not being a property of the file or even the inode
> > > itself, but rather an attribute of the specific IO being issued from the
> > > userspace application.
> > The problem is that keeping track of that is expensive for buffered
> > writes.  It's a model that only works for direct IO.  Arguably we
> > could make it work for O_SYNC buffered IO, but that'll require some
> > surgery.
> 
> To me, O_ATOMIC would be required for buffered atomic writes IO, as we want
> a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.

Would using the same min and max order for the inode work instead?

  Luis