[PATCH v2 00/16] block atomic writes

Wed Dec 13 01:32:06 PST 2023

On 12/12/2023 16:32, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
>> Two new fields are added to struct statx - atomic_write_unit_min and
>> atomic_write_unit_max. For each atomic individual write, the total length
>> of a write must be a between atomic_write_unit_min and
>> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
>> at a natural offset in the file wrt the write length.
>>
>> SCSI sd.c and scsi_debug and NVMe kernel support is added.
>>
>> Some open questions:
>> - How to make API extensible for when we have no HW support? In that case,
>>    we would prob not have to follow rule of power-of-2 length et al.
>>    As a possible solution, maybe we can say that atomic writes are
>>    supported for the file via statx, but not set unit_min and max values,
>>    and this means that writes need to be just FS block aligned there.
> I don't think the power of two length is much of a problem to be
> honest, and if we every want to lift it we can still do that easily
> by adding a new flag or limit.

ok, but it would be nice to have some idea on what that flag or limit 
change would be.

> 
> What I'm a lot more worried about is how to tell the file system that
> allocations are done right for these requirement.  There is no way
> a user can know that allocations in an existing file are properly
> aligned, so atomic writes will just fail on existing files.
> 
> I suspect we need an on-disk flag that forces allocations to be
> aligned to the atomic write limit, in some ways similar how the
> XFS rt flag works.  You'd need to set it on an empty file, and all
> allocations after that are guaranteed to be properly aligned.

Hmmm... so how is this different to the XFS forcealign feature?

For XFS, I thought that your idea was to always CoW new extents for 
misaligned extents or writes which spanned multiple extents.

> 
>> - For block layer, should atomic_write_unit_max be limited by
>>    max_sectors_kb? Currently it is not.
> Well.  It must be limited to max_hw_sectors to actually work.

Sure, as max_hw_sectors may also be limited by DMA controller max 
mapping size.

> max_sectors is a software limit below that, which with modern hardware
> is actually pretty silly and a real performance issue with todays
> workloads when people don't tweak it..

Right, so we should limit atomic write queue limits to max_hw_sectors. 
But people can still tweak max_sectors, and I am inclined to say that 
atomic_write_unit_max et al should be (dynamically) limited to 
max_sectors also.

> 
>> - How to improve requirement that iovecs are PAGE-aligned.
>>    There are 2x issues:
>>    a. We impose this rule to not split BIOs due to virt boundary for
>>       NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>>       16K/64K pages). Easy solution is to impose requirement that iovecs
>>       are 4K-aligned.
>>    b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
> .. we require any device that wants to support atomic writes to not
> have that silly limit.  For NVMe that would require SGL support
> (and some driver changes I've been wanting to make for long where
> we always use SGLs for transfers larger than a single PRP if supported)

If we could avoid dealing with a virt boundary, then that would be nice.

Are there any patches yet for the change to always use SGLs for 
transfers larger than a single PRP?

On a related topic, I am not sure about how - or if we even should - 
enforce iovec PAGE-alignment or length; rather, the user could just be 
advised that iovecs must be PAGE-aligned and min PAGE length to achieve 
atomic_write_unit_max.

> 
> 
>> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>>    would be horrible, should we add some kernel storage stack software
>>    integrity checks?
> Yes, I think we'll need asserts in the drivers.  At least for NVMe I
> will insist on them. 

Please see 16/16 in this series.

> For SCSI I think the device actually checks
> because the atomic writes are a different command anyway, or am I
> misunderstanding how SCSI works here?

Right, for WRITE ATOMIC (16) the target will reject a command when it 
does not respect the device atomic write limits.

Thanks,
John