[PATCH 17/21] fs: xfs: iomap atomic write support

John Garry john.g.garry at oracle.com
Tue Dec 5 03:09:46 PST 2023


On 05/12/2023 04:55, Theodore Ts'o wrote:
>> AFAICS, this is without any kernel changes, so no guarantee of unwanted
>> splitting or merging of bios.
> Well, more than one company has audited the kernel paths, and it turns
> out that for selected Kernel versions, after doing desk-check
> verification of the relevant kernel baths, as well as experimental
> verification via testing to try to find torn writes in the kernel, we
> can make it safe for specific kernel versions which might be used in
> hosted MySQL instances where we control the kernel, the mysql server,
> and the emulated block device (and we know the database is doing
> Direct I/O writes --- this won't work for PostgreSQL).  I gave a talk
> about this at Google I/O Next '18, five years ago[1].
> 
> [1]https://urldefense.com/v3/__https://www.youtube.com/watch?v=gIeuiGg-_iw__;!!ACWV5N9M2RV99hQ!I4iRp4xUyzAT0UwuEcnUBBCPKLXFKfk5FNmysFbKcQYfl0marAll5xEEVyB5mMFDqeckCWLmjU1aCR2Z$  
> 
> Given the performance gains (see the talk (see the comparison of the
> at time 19:31 and at 29:57) --- it's quite compelling.
> 
> Of course, I wouldn't recommend this approach for a naive sysadmin,
> since most database adminsitrators won't know how to audit kernel code
> (see the discussion at time 35:10 of the video), and reverify the
> entire software stack before every kernel upgrade.

Sure

>  The challenge is
> how to do this safely.

Right, and that is why I would be concerned about advertising torn-write 
protection support, but someone has not gone through the effort of 
auditing and verification phase to ensure that this does not happen in 
their software stack ever.

> 
> The fact remains that both Amazon's EBS and Google's Persistent Disk
> products are implemented in such a way that writes will not be torn
> below the virtual machine, and the guarantees are in fact quite a bit
> stronger than what we will probably end up advertising via NVMe and/or
> SCSI.  It wouldn't surprise me if this is the case (or could be made
> to be the case) For Oracle Cloud as well.
> 
> The question is how to make this guarantee so that the kernel knows
> when various cloud-provided block devicse do provide these greater
> guarantees, and then how to make it be an architected feature, as
> opposed to a happy implementation detail that has to be verified at
> every kernel upgrade.

The kernel can only judge atomic write support from what the HW product 
data tells us, so cloud-provided block devices need to provide that 
information as best possible if emulating the some storage technology.

Thanks,
John




More information about the Linux-nvme mailing list