[PATCH v6 00/10] block atomic writes

Fri Apr 5 03:06:00 PDT 2024

On 04/04/2024 17:48, Matthew Wilcox wrote:
>>> The thing is that there's no requirement for an interface as complex as
>>> the one you're proposing here.  I've talked to a few database people
>>> and all they want is to increase the untorn write boundary from "one
>>> disc block" to one database block, typically 8kB or 16kB.
>>>
>>> So they would be quite happy with a much simpler interface where they
>>> set the inode block size at inode creation time,
>> We want to support untorn writes for bdev file operations - how can we set
>> the inode block size there? Currently it is based on logical block size.
> ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
> think we can remove that limitation with the bs>PS patches.

We want a consistent interface for bdev and regular files, so that would 
need to work for FSes also. FSes(XFS) work based on a homogeneous inode 
blocksize, which is the SB blocksize.

Furthermore, we would seem to be mixing different concepts here. 
Currently in Linux we say that a logical block size write is atomic. In 
the block layer, we split BIOs on LBS boundaries. iomap creates BIOs 
based on LBS boundaries. But writing a FS block is not always guaranteed 
to be atomic, as far as I'm concerned. So just increasing the inode 
block size / FS block size does not really change anything, in itself.

> 
>>> and then all writes to
>>> that inode were guaranteed to be untorn.  This would also be simpler to
>>> implement for buffered writes.
>> We did consider that. Won't that lead to the possibility of breaking
>> existing applications which want to do regular unaligned writes to these
>> files? We do know that mysql/innodb does have some "compressed" mode of
>> operation, which involves regular writes to the same file which wants untorn
>> writes.
> If you're talking about "regular unaligned buffered writes", then that
> won't break.  If you cross a folio boundary, the result may be torn,
> but if you're crossing a block boundary you expect that.
> 
>> Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we
>> always want these for such a file?
> Do untorn writes actually exist in SCSI?  I was under the impression
> nobody had actually implemented them in SCSI hardware.

I know that some SCSI targets actually atomically write data in chunks > 
LBS. Obviously atomic vs non-atomic performance is a moot point there, 
as data is implicitly always atomically written.

We actually have an mysql/innodb port of this API working on such a SCSI 
target.

However I am not sure about atomic write support for other SCSI targets.

> 
>> We saw untorn writes as not being a property of the file or even the inode
>> itself, but rather an attribute of the specific IO being issued from the
>> userspace application.
> The problem is that keeping track of that is expensive for buffered
> writes.  It's a model that only works for direct IO.  Arguably we
> could make it work for O_SYNC buffered IO, but that'll require some
> surgery.

To me, O_ATOMIC would be required for buffered atomic writes IO, as we 
want a fixed-sized IO, so that would mean no mixing of atomic and 
non-atomic IO.

Thanks,
John