[PATCH v6 00/10] block atomic writes
John Garry
john.g.garry at oracle.com
Fri Apr 5 03:06:00 PDT 2024
On 04/04/2024 17:48, Matthew Wilcox wrote:
>>> The thing is that there's no requirement for an interface as complex as
>>> the one you're proposing here. I've talked to a few database people
>>> and all they want is to increase the untorn write boundary from "one
>>> disc block" to one database block, typically 8kB or 16kB.
>>>
>>> So they would be quite happy with a much simpler interface where they
>>> set the inode block size at inode creation time,
>> We want to support untorn writes for bdev file operations - how can we set
>> the inode block size there? Currently it is based on logical block size.
> ioctl(BLKBSZSET), I guess? That currently limits to PAGE_SIZE, but I
> think we can remove that limitation with the bs>PS patches.
We want a consistent interface for bdev and regular files, so that would
need to work for FSes also. FSes(XFS) work based on a homogeneous inode
blocksize, which is the SB blocksize.
Furthermore, we would seem to be mixing different concepts here.
Currently in Linux we say that a logical block size write is atomic. In
the block layer, we split BIOs on LBS boundaries. iomap creates BIOs
based on LBS boundaries. But writing a FS block is not always guaranteed
to be atomic, as far as I'm concerned. So just increasing the inode
block size / FS block size does not really change anything, in itself.
>
>>> and then all writes to
>>> that inode were guaranteed to be untorn. This would also be simpler to
>>> implement for buffered writes.
>> We did consider that. Won't that lead to the possibility of breaking
>> existing applications which want to do regular unaligned writes to these
>> files? We do know that mysql/innodb does have some "compressed" mode of
>> operation, which involves regular writes to the same file which wants untorn
>> writes.
> If you're talking about "regular unaligned buffered writes", then that
> won't break. If you cross a folio boundary, the result may be torn,
> but if you're crossing a block boundary you expect that.
>
>> Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we
>> always want these for such a file?
> Do untorn writes actually exist in SCSI? I was under the impression
> nobody had actually implemented them in SCSI hardware.
I know that some SCSI targets actually atomically write data in chunks >
LBS. Obviously atomic vs non-atomic performance is a moot point there,
as data is implicitly always atomically written.
We actually have an mysql/innodb port of this API working on such a SCSI
target.
However I am not sure about atomic write support for other SCSI targets.
>
>> We saw untorn writes as not being a property of the file or even the inode
>> itself, but rather an attribute of the specific IO being issued from the
>> userspace application.
> The problem is that keeping track of that is expensive for buffered
> writes. It's a model that only works for direct IO. Arguably we
> could make it work for O_SYNC buffered IO, but that'll require some
> surgery.
To me, O_ATOMIC would be required for buffered atomic writes IO, as we
want a fixed-sized IO, so that would mean no mixing of atomic and
non-atomic IO.
Thanks,
John
More information about the Linux-nvme
mailing list