[PATCH v2 00/16] block atomic writes
John Garry
john.g.garry at oracle.com
Wed Dec 13 01:32:06 PST 2023
On 12/12/2023 16:32, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
>> Two new fields are added to struct statx - atomic_write_unit_min and
>> atomic_write_unit_max. For each atomic individual write, the total length
>> of a write must be a between atomic_write_unit_min and
>> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
>> at a natural offset in the file wrt the write length.
>>
>> SCSI sd.c and scsi_debug and NVMe kernel support is added.
>>
>> Some open questions:
>> - How to make API extensible for when we have no HW support? In that case,
>> we would prob not have to follow rule of power-of-2 length et al.
>> As a possible solution, maybe we can say that atomic writes are
>> supported for the file via statx, but not set unit_min and max values,
>> and this means that writes need to be just FS block aligned there.
> I don't think the power of two length is much of a problem to be
> honest, and if we every want to lift it we can still do that easily
> by adding a new flag or limit.
ok, but it would be nice to have some idea on what that flag or limit
change would be.
>
> What I'm a lot more worried about is how to tell the file system that
> allocations are done right for these requirement. There is no way
> a user can know that allocations in an existing file are properly
> aligned, so atomic writes will just fail on existing files.
>
> I suspect we need an on-disk flag that forces allocations to be
> aligned to the atomic write limit, in some ways similar how the
> XFS rt flag works. You'd need to set it on an empty file, and all
> allocations after that are guaranteed to be properly aligned.
Hmmm... so how is this different to the XFS forcealign feature?
For XFS, I thought that your idea was to always CoW new extents for
misaligned extents or writes which spanned multiple extents.
>
>> - For block layer, should atomic_write_unit_max be limited by
>> max_sectors_kb? Currently it is not.
> Well. It must be limited to max_hw_sectors to actually work.
Sure, as max_hw_sectors may also be limited by DMA controller max
mapping size.
> max_sectors is a software limit below that, which with modern hardware
> is actually pretty silly and a real performance issue with todays
> workloads when people don't tweak it..
Right, so we should limit atomic write queue limits to max_hw_sectors.
But people can still tweak max_sectors, and I am inclined to say that
atomic_write_unit_max et al should be (dynamically) limited to
max_sectors also.
>
>> - How to improve requirement that iovecs are PAGE-aligned.
>> There are 2x issues:
>> a. We impose this rule to not split BIOs due to virt boundary for
>> NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>> 16K/64K pages). Easy solution is to impose requirement that iovecs
>> are 4K-aligned.
>> b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
> .. we require any device that wants to support atomic writes to not
> have that silly limit. For NVMe that would require SGL support
> (and some driver changes I've been wanting to make for long where
> we always use SGLs for transfers larger than a single PRP if supported)
If we could avoid dealing with a virt boundary, then that would be nice.
Are there any patches yet for the change to always use SGLs for
transfers larger than a single PRP?
On a related topic, I am not sure about how - or if we even should -
enforce iovec PAGE-alignment or length; rather, the user could just be
advised that iovecs must be PAGE-aligned and min PAGE length to achieve
atomic_write_unit_max.
>
>
>> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>> would be horrible, should we add some kernel storage stack software
>> integrity checks?
> Yes, I think we'll need asserts in the drivers. At least for NVMe I
> will insist on them.
Please see 16/16 in this series.
> For SCSI I think the device actually checks
> because the atomic writes are a different command anyway, or am I
> misunderstanding how SCSI works here?
Right, for WRITE ATOMIC (16) the target will reject a command when it
does not respect the device atomic write limits.
Thanks,
John
More information about the Linux-nvme
mailing list